Skip to main content
The research/src/ directory contains Python scripts that collect and prepare the open data used by Rosie and Jarbas. Each script is focused on a single data source or transformation step.

Dependencies

Scripts share a common set of libraries declared in research/requirements.txt:
aiofiles==0.4.0
aiohttp==3.5.4
beautifulsoup4==4.7.1
geopy==1.18.1
grequests==0.3.0
humanize==0.5.1
numpy==1.16.2
pandas==0.24.1
python-decouple==3.1
requests==2.21.0
serenata-toolbox
tqdm==4.31.1
The serenata-toolbox package is a pip-installable library that handles dataset versioning and remote storage. It is used across multiple scripts for uploading and retrieving datasets.

Reimbursement data

fetch_receipts.py

Downloads PDF receipt images from the Lower House server for every reimbursement in the local datasets.
python research/src/fetch_receipts.py <target_directory> --limit 100
  • Scans all .xz dataset files in research/data/ for receipt URLs.
  • Downloads each PDF to <target_directory>/<applicant_id>/<year>/<document_id>.pdf.
  • Skips receipts that have already been downloaded.
  • Uses a multiprocessing pool (4 workers by default) for parallel downloads.
  • Prints a progress report with total size, skipped files, and download errors.
Downloading the complete receipt archive may require more than 1 TB of disk space. Use --limit to download a subset.

group_receipts.py

Merges the separate year-based reimbursement datasets (current-year, last-year, previous-years) into a single consolidated reimbursements.xz file. The output file is date-stamped (e.g., 2024-01-15-reimbursements.xz) and stored in research/data/.

Company and supplier data

fetch_cnpj_info.py

Fetches company registration details from the Brazilian federal revenue service (Receita Federal) for every CNPJ that appears in the reimbursement datasets.
python research/src/fetch_cnpj_info.py ./data/2016-12-10-reimbursements.xz \
    -t 10 \
    -p 177.67.84.135:8080 177.67.82.80:8080
  • Accepts one or more source dataset files as positional arguments.
  • -t / --threads: number of concurrent fetch threads (default: 10).
  • -p / --proxies: optional list of HTTP proxy addresses to distribute requests.
  • Saves progress incrementally to data/companies-partial.xz every 100 records to avoid data loss on interruption.
  • Translates Brazilian Portuguese column names to English and decomposes nested fields (main activity, secondary activities, partners list).

geocode_addresses.py

Adds latitude/longitude coordinates to each company in data/companies.xz using the Google Maps Geocoding API.
  • Reads company addresses from the companies.xz dataset.
  • Calls the Google Maps Geocoder via geopy with up to 40 concurrent threads.
  • Caches results as pickle files in data/companies/ so interrupted runs can resume.
  • Writes the final dataset back to data/companies.xz and removes the temporary cache directory.
Set GOOGLE_API_KEY in your .env file before running this script.

Congressperson data

fetch_congressperson_details.py

Fetches civil name, birth date, and gender for every congressperson from the Chamber of Deputies web service.
  • Collects all unique congressperson_id values across the current-year, last-year, and previous-years datasets.
  • Calls ObterDetalhesDeputado on the Chamber’s SOAP endpoint for each ID.
  • Outputs a date-stamped file (e.g., 2024-01-15-congressperson-details.xz) to research/data/.

Campaign finance data

fetch_campaign_donations.py

Downloads campaign donation reports for federal elections from the Brazilian Electoral Court (TSE).
  • Covers election years 2010, 2012, 2014, and 2016.
  • Downloads ZIP archives from agencia.tse.jus.br for candidates, parties, and committees.
  • Extracts and merges the text files into a single normalized dataset.

fetch_tse_data.py

Fetches additional electoral data from the TSE open data portal, complementing the campaign donation data with candidate and election metadata.

Sanctions data

fetch_federal_sanctions.py

Downloads the federal government sanctions lists published by the Office of the Comptroller General (CGU).
  • Retrieves the most recently published version of each sanctions list by walking backwards from today’s date until a valid file is found.
  • Downloads and extracts ZIP archives for each dataset.
  • Translates column names from Portuguese to English.
  • Removes temporary files after processing.
The sanctions datasets are used by Rosie to flag suppliers that appear on federal ineligibility or suspension lists.

Venue verification data

fetch_foursquare_info.py

Queries the Foursquare Places API for venue information at supplier addresses, used to verify whether a claimed expense location is consistent with the supplier category.

fetch_yelp_info.py

Queries the Yelp Fusion API for business information at supplier addresses, providing an independent venue verification source alongside the Foursquare data.

Utilities

backup_data.py

Uploads all locally generated datasets to the remote dataset repository via serenata-toolbox:
from serenata_toolbox.datasets import Datasets

datasets = Datasets('data')

for file_name in datasets.pending:
    file_path = os.path.join(datasets.local.directory, file_name)
    datasets.remote.upload(file_path)
Run this after generating new datasets to make them available to other contributors via the toolbox.

translation_table.py and table_config.json

Shared mapping tables that define column names, data types, and CNPJ column locations for each source dataset. These are imported by several fetch scripts to keep transformations consistent.

utils.py

Common helper functions shared across multiple scripts.

Community notebooks

The community maintains an independent collection of Jupyter notebooks that explore the datasets and prototype new classifiers. You can find them at github.com/okfn-brasil/notebooks.
Use the serenata-toolbox to download the latest versioned datasets instead of re-running all fetch scripts from scratch. Install it with pip install serenata-toolbox.

Build docs developers (and LLMs) love