Skip to main content
Rosie relies on two types of datasets: reimbursement records (one CSV per year) and a companies registry. Both are fetched from public government sources via the serenata-toolbox package.

Chamber of Deputies Datasets

Reimbursement Data

Yearly CSV files covering every CEAP reimbursement request filed by a congressperson. Rosie fetches historical data starting from 2009 and updates through the current year.
STARTING_YEAR = 2009

def update_reimbursements(self, years=None):
    if not years:
        next_year = date.today().year + 1
        years = range(self.STARTING_YEAR, next_year)

    for year in years:
        self.log.info(f'Updating reimbursements from {year}')
        try:
            Reimbursements(year, self.path)()
        except HTTPError as e:
            self.log.error(f'Could not update Reimbursements from year {year}: {e} - {e.filename}')
Files are stored with the naming pattern:
reimbursements-YYYY.csv
All yearly files are loaded and concatenated into a single DataFrame:
REIMBURSEMENTS_PATTERN = r'reimbursements-\d{4}\.csv'

@property
def reimbursements(self):
    df = pd.DataFrame()
    paths = (
        str(path) for path in Path(self.path).glob('*.csv')
        if match(self.REIMBURSEMENTS_PATTERN, path.name)
    )

    for path in paths:
        self.log.info(f'Loading reimbursements from {path}')
        year_df = pd.read_csv(path, dtype=self.DTYPE, low_memory=False)
        df = df.append(year_df)

    return df
Key columns loaded with explicit string typing to preserve leading zeros:
ColumnType
applicant_idstring
cnpj_cpfstring
congressperson_idstring
subquota_numberstring

Companies Dataset

A snapshot of the Brazilian Federal Revenue company registry. Rosie downloads a fixed version of this file:
COMPANIES_DATASET = '2016-09-03-companies.xz'
It is fetched via serenata-toolbox’s fetch utility and stored in the data directory. When loading, CNPJ values are stripped of non-numeric characters:
@property
def companies(self):
    path = Path(self.path) / self.COMPANIES_DATASET
    df = pd.read_csv(path, dtype={'cnpj': np.str}, low_memory=False)
    df['cnpj'] = df['cnpj'].str.replace(r'\D', '')
    return df

Merging Datasets

The Adapter.dataset property merges both sources on the supplier tax ID:
@property
def dataset(self):
    self.update_datasets()
    df = self.reimbursements.merge(
        self.companies,
        how='left',
        left_on='cnpj_cpf',
        right_on='cnpj'
    )
    self.prepare_dataset(df)
    return df
A left join is used so that reimbursements without a matching CNPJ in the companies registry are still retained — they will simply have NaN values for company-related columns.

Column Normalization

After merging, columns are renamed to the Serenata de Amor standard:
RENAME_COLUMNS = {
    'subquota_description': 'category',
    'total_net_value': 'net_value',
    'cnpj_cpf': 'recipient_id',
    'supplier': 'recipient'
}
Two additional columns are created during normalization:
  • document_type — categorical: bill_of_sale, simple_receipt, or expense_made_abroad (document type codes 3, 4, 5 are treated as None pending clarification from the Chamber)
  • is_party_expense — boolean: True when congressperson_id is null (i.e. the expense belongs to a party rather than an individual)
The subquota_description value "Congressperson meal" is normalized to "Meal" to match what classifiers expect:
rename = {'Congressperson meal': 'Meal'}
df['subquota_description'] = df['subquota_description'].replace(rename)

Federal Senate Dataset

The Federal Senate adapter uses serenata-toolbox’s Dataset class to fetch, translate, and clean Senate reimbursement records:
def update_datasets(self):
    os.makedirs(self.path, exist_ok=True)
    federal_senate = Dataset(self.path)
    federal_senate.fetch()
    federal_senate.translate()
    federal_senate_reimbursements_path = federal_senate.clean()
    return federal_senate_reimbursements_path
Column names are mapped to the same standard as the Chamber module:
COLUMNS = {
    'net_value': 'reimbursement_value',
    'recipient_id': 'cnpj_cpf',
    'recipient': 'supplier',
}
Since the Senate data has no document_type column, all rows receive the value 'unknown' — which is a valid type accepted by InvalidCnpjCpfClassifier:
self._dataset['document_type'] = 'unknown'

Suspicions Output

After all classifiers have run, the Core engine writes a compressed CSV:
output = os.path.join(self.data_path, 'suspicions.xz')
kwargs = dict(compression='xz', encoding='utf-8', index=False)
self.suspicions.to_csv(output, **kwargs)

Chamber of Deputies output columns

The suspicions file is keyed by unique identifiers and then one column per classifier:
ColumnDescription
applicant_idUnique ID of the person who filed the reimbursement
yearQuota year of the reimbursement
document_idDocument identifier
meal_price_outlierTrue if the meal price was statistically abnormal
over_monthly_subquota_limitTrue if the expense exceeded the subquota monthly cap
suspicious_traveled_speed_dayTrue if physically impossible travel was detected
invalid_cnpj_cpfTrue if the supplier tax ID failed checksum validation
election_expensesTrue if the supplier is a political candidate entity
irregular_companies_classifierTrue if the supplier had an irregular registration status

Federal Senate output columns

Federal Senate suspicions retain the full dataset (no UNIQUE_IDS are defined) and include only:
ColumnDescription
invalid_cnpj_cpfTrue if the supplier tax ID failed checksum validation
Trained classifier models are cached as .pkl files alongside the datasets. The MealPriceOutlierClassifier and TraveledSpeedsClassifier models are reused on subsequent runs, avoiding the cost of refitting from scratch.
MonthlySubquotaLimitClassifier is never cached because the fitted object exceeds a size joblib handles reliably. It is always retrained on each run from the full dataset.

Build docs developers (and LLMs) love