The pipeline at a glance
Download open data
Rosie uses the
serenata-toolbox pip package to download reimbursement records from the Brazilian Chamber of Deputies and Federal Senate. Data is fetched as CSV files per year, starting from 2009, and a companies dataset is fetched alongside for cross-referencing.Prepare the dataset
Rosie’s
Adapter merges reimbursement records with company data, normalizes column names, coerces date types, and categorizes document types. The result is a single cleaned pandas DataFrame ready for classification.Run the classifiers
Rosie’s
Core object iterates over every configured classifier in sequence. Each classifier fits a model on the dataset and predicts whether each row is suspicious or not. Numeric classifiers return -1 (suspicious) or 1 (normal); rule-based classifiers return boolean True/False. The Core engine normalizes all predictions to boolean columns in the suspicions DataFrame.Output suspicions.xz
After all classifiers have run, the suspicions DataFrame is written to a compressed CSV file at
/tmp/serenata-data/suspicions.xz. Each row maps a unique reimbursement (identified by applicant_id, year, and document_id) to a boolean column for each classifier.Import into Jarbas
A Django management command loads
suspicions.xz into the PostgreSQL database. Separate commands load the full reimbursements dataset and company records. A searchvector command then builds a PostgreSQL full-text search index.Serve the dashboard and API
Jarbas runs a Django REST Framework API and an Elm-based frontend dashboard. Citizens can browse reimbursements, filter by suspicion type, congressperson, date, state, party, and more.
Tweet suspicious findings
The
tweets management command instructs Jarbas to post about suspicious reimbursements on Twitter as @RosieDaSerenata, tagging the relevant congressperson and inviting public scrutiny.Downloading reimbursement data
Rosie delegates all data fetching to theserenata-toolbox package. Inside rosie/rosie/chamber_of_deputies/adapter.py, the Adapter class handles this:
The classifiers
Rosie’sCore object runs each classifier in sequence. Every classifier implements the scikit-learn TransformerMixin interface with fit, transform, and predict methods.
MealPriceOutlierClassifier
MealPriceOutlierClassifier
Detects meal expenses whose price is a statistical outlier compared to other reimbursements at the same restaurant. Uses KMeans clustering to group restaurants and flags any expense where the net value exceeds the cluster threshold (mean + 4 standard deviations, or mean + 3 standard deviations for well-known companies).Key column:
category == "Meal", recipient_id (14-digit CNPJ only)TraveledSpeedsClassifier
TraveledSpeedsClassifier
Detects reimbursements that would require the congressperson to have traveled at an implausible speed between two locations on the same day. It calculates the geographic distance between expense locations and checks whether the implied travel speed is physically impossible.Key columns:
issue_date, applicant_id, latitude, longitudeElectionExpensesClassifier
ElectionExpensesClassifier
Flags reimbursements made to companies whose Brazilian Federal Revenue legal entity category is
409-0 - CANDIDATO A CARGO POLITICO ELETIVO — entities registered as electoral candidates. Congressional funds should not be spent at such entities.Key column: legal_entityIrregularCompaniesClassifier
IrregularCompaniesClassifier
Checks the official registration status of the supplier company in the Brazilian Federal Revenue. Flags reimbursements to companies that were in an irregular, suspended, or closed state at the time the expense was made.Key columns:
situation, situation_date, issue_dateMonthlySubquotaLimitClassifier
MonthlySubquotaLimitClassifier
Detects cases where a congressperson’s total reimbursements for a given subquota category in a single month exceed the legal limit. Each subquota (expense category) has a defined monthly ceiling; this classifier sums expenditure per applicant, month, year, and subquota and flags overruns.Key columns:
applicant_id, month, year, subquota_number, net_valueInvalidCnpjCpfClassifier
InvalidCnpjCpfClassifier
Validates the
recipient_id field — either a CNPJ (Brazilian company tax ID) or a CPF (Brazilian personal tax ID) — by computing the expected check digit and comparing it to the submitted value. An invalid ID may indicate a fictitious supplier.Key columns: document_type, recipient_idThe classifiers are configured in
rosie/rosie/chamber_of_deputies/settings.py. Each classifier is keyed by a human-readable snake_case name that becomes a column in the output suspicions.xz file (e.g., meal_price_outlier, invalid_cnpj_cpf).The suspicions.xz output
After all classifiers run,Core writes the results:
applicant_id, year, and document_id. Columns correspond to each classifier’s name and contain True (suspicious) or False (not suspicious).
