Rosie relies on two types of datasets: reimbursement records (one CSV per year) and a companies registry. Both are fetched from public government sources via the serenata-toolbox package.
Chamber of Deputies Datasets
Reimbursement Data
Yearly CSV files covering every CEAP reimbursement request filed by a congressperson. Rosie fetches historical data starting from 2009 and updates through the current year.
STARTING_YEAR = 2009
def update_reimbursements(self, years=None):
if not years:
next_year = date.today().year + 1
years = range(self.STARTING_YEAR, next_year)
for year in years:
self.log.info(f'Updating reimbursements from {year}')
try:
Reimbursements(year, self.path)()
except HTTPError as e:
self.log.error(f'Could not update Reimbursements from year {year}: {e} - {e.filename}')
Files are stored with the naming pattern:
All yearly files are loaded and concatenated into a single DataFrame:
REIMBURSEMENTS_PATTERN = r'reimbursements-\d{4}\.csv'
@property
def reimbursements(self):
df = pd.DataFrame()
paths = (
str(path) for path in Path(self.path).glob('*.csv')
if match(self.REIMBURSEMENTS_PATTERN, path.name)
)
for path in paths:
self.log.info(f'Loading reimbursements from {path}')
year_df = pd.read_csv(path, dtype=self.DTYPE, low_memory=False)
df = df.append(year_df)
return df
Key columns loaded with explicit string typing to preserve leading zeros:
| Column | Type |
|---|
applicant_id | string |
cnpj_cpf | string |
congressperson_id | string |
subquota_number | string |
Companies Dataset
A snapshot of the Brazilian Federal Revenue company registry. Rosie downloads a fixed version of this file:
COMPANIES_DATASET = '2016-09-03-companies.xz'
It is fetched via serenata-toolbox’s fetch utility and stored in the data directory. When loading, CNPJ values are stripped of non-numeric characters:
@property
def companies(self):
path = Path(self.path) / self.COMPANIES_DATASET
df = pd.read_csv(path, dtype={'cnpj': np.str}, low_memory=False)
df['cnpj'] = df['cnpj'].str.replace(r'\D', '')
return df
Merging Datasets
The Adapter.dataset property merges both sources on the supplier tax ID:
@property
def dataset(self):
self.update_datasets()
df = self.reimbursements.merge(
self.companies,
how='left',
left_on='cnpj_cpf',
right_on='cnpj'
)
self.prepare_dataset(df)
return df
A left join is used so that reimbursements without a matching CNPJ in the companies registry are still retained — they will simply have NaN values for company-related columns.
Column Normalization
After merging, columns are renamed to the Serenata de Amor standard:
RENAME_COLUMNS = {
'subquota_description': 'category',
'total_net_value': 'net_value',
'cnpj_cpf': 'recipient_id',
'supplier': 'recipient'
}
Two additional columns are created during normalization:
document_type — categorical: bill_of_sale, simple_receipt, or expense_made_abroad (document type codes 3, 4, 5 are treated as None pending clarification from the Chamber)
is_party_expense — boolean: True when congressperson_id is null (i.e. the expense belongs to a party rather than an individual)
The subquota_description value "Congressperson meal" is normalized to "Meal" to match what classifiers expect:
rename = {'Congressperson meal': 'Meal'}
df['subquota_description'] = df['subquota_description'].replace(rename)
Federal Senate Dataset
The Federal Senate adapter uses serenata-toolbox’s Dataset class to fetch, translate, and clean Senate reimbursement records:
def update_datasets(self):
os.makedirs(self.path, exist_ok=True)
federal_senate = Dataset(self.path)
federal_senate.fetch()
federal_senate.translate()
federal_senate_reimbursements_path = federal_senate.clean()
return federal_senate_reimbursements_path
Column names are mapped to the same standard as the Chamber module:
COLUMNS = {
'net_value': 'reimbursement_value',
'recipient_id': 'cnpj_cpf',
'recipient': 'supplier',
}
Since the Senate data has no document_type column, all rows receive the value 'unknown' — which is a valid type accepted by InvalidCnpjCpfClassifier:
self._dataset['document_type'] = 'unknown'
Suspicions Output
After all classifiers have run, the Core engine writes a compressed CSV:
output = os.path.join(self.data_path, 'suspicions.xz')
kwargs = dict(compression='xz', encoding='utf-8', index=False)
self.suspicions.to_csv(output, **kwargs)
Chamber of Deputies output columns
The suspicions file is keyed by unique identifiers and then one column per classifier:
| Column | Description |
|---|
applicant_id | Unique ID of the person who filed the reimbursement |
year | Quota year of the reimbursement |
document_id | Document identifier |
meal_price_outlier | True if the meal price was statistically abnormal |
over_monthly_subquota_limit | True if the expense exceeded the subquota monthly cap |
suspicious_traveled_speed_day | True if physically impossible travel was detected |
invalid_cnpj_cpf | True if the supplier tax ID failed checksum validation |
election_expenses | True if the supplier is a political candidate entity |
irregular_companies_classifier | True if the supplier had an irregular registration status |
Federal Senate output columns
Federal Senate suspicions retain the full dataset (no UNIQUE_IDS are defined) and include only:
| Column | Description |
|---|
invalid_cnpj_cpf | True if the supplier tax ID failed checksum validation |
Trained classifier models are cached as .pkl files alongside the datasets. The MealPriceOutlierClassifier and TraveledSpeedsClassifier models are reused on subsequent runs, avoiding the cost of refitting from scratch.
MonthlySubquotaLimitClassifier is never cached because the fitted object exceeds a size joblib handles reliably. It is always retrained on each run from the full dataset.