How Rosie Works
Fetch datasets
Rosie downloads reimbursement CSVs and the companies dataset from public government sources via the
serenata-toolbox. Data is fetched back to the year 2009.Merge and normalize
The
Adapter class merges reimbursements with company registration data (left-joined on CNPJ), renames columns to the Serenata standard, coerces dates, and normalizes category labels.Run classifiers
The
Core engine iterates over every classifier defined in the module’s settings.py. Each classifier receives the full dataset and returns a prediction per row: suspicious (True / -1) or normal (False / 1).Modules
Rosie covers two legislative bodies, each with its own adapter and classifier settings:Chamber of Deputies
The primary module. Uses six classifiers covering meal prices, travel speeds, election expenses, irregular companies, monthly subquota limits, and invalid tax IDs.Unique IDs:
applicant_id, year, document_idFederal Senate
A lighter module that currently runs the
InvalidCnpjCpfClassifier against Senate reimbursements. Document types are normalised to unknown since the Senate data does not include a document type column.Unique IDs: none (full dataset is kept)Output
After a run finishes, Rosie writes a single compressed file:True (suspicious) or False (normal).
Key Dependencies
| Package | Role |
|---|---|
scikit-learn | Classifier base classes (TransformerMixin) and KMeans clustering |
pandas | DataFrame manipulation and CSV I/O |
numpy | Numerical operations and array helpers |
geopy | Geodesic distance calculation (Vincenty formula) for the travel speed classifier |
brutils | Brazilian CPF and CNPJ validation |
serenata-toolbox | Fetches reimbursement and company datasets from government sources |
docopt | CLI argument parsing for rosie.py |
scipy | Scientific computing (must be installed before scikit-learn) |
scipy must appear before scikit-learn in requirements.txt so the wheel builds correctly inside Docker.