Datasets

Rosie relies on two types of datasets: reimbursement records (one CSV per year) and a companies registry. Both are fetched from public government sources via the serenata-toolbox package.

Chamber of Deputies Datasets

Reimbursement Data

Yearly CSV files covering every CEAP reimbursement request filed by a congressperson. Rosie fetches historical data starting from 2009 and updates through the current year.

STARTING_YEAR = 2009

def update_reimbursements(self, years=None):
    if not years:
        next_year = date.today().year + 1
        years = range(self.STARTING_YEAR, next_year)

    for year in years:
        self.log.info(f'Updating reimbursements from {year}')
        try:
            Reimbursements(year, self.path)()
        except HTTPError as e:
            self.log.error(f'Could not update Reimbursements from year {year}: {e} - {e.filename}')

Files are stored with the naming pattern:

reimbursements-YYYY.csv

All yearly files are loaded and concatenated into a single DataFrame:

REIMBURSEMENTS_PATTERN = r'reimbursements-\d{4}\.csv'

@property
def reimbursements(self):
    df = pd.DataFrame()
    paths = (
        str(path) for path in Path(self.path).glob('*.csv')
        if match(self.REIMBURSEMENTS_PATTERN, path.name)
    )

    for path in paths:
        self.log.info(f'Loading reimbursements from {path}')
        year_df = pd.read_csv(path, dtype=self.DTYPE, low_memory=False)
        df = df.append(year_df)

    return df

Key columns loaded with explicit string typing to preserve leading zeros:

Column	Type
`applicant_id`	string
`cnpj_cpf`	string
`congressperson_id`	string
`subquota_number`	string

Companies Dataset

A snapshot of the Brazilian Federal Revenue company registry. Rosie downloads a fixed version of this file:

COMPANIES_DATASET = '2016-09-03-companies.xz'

It is fetched via serenata-toolbox’s fetch utility and stored in the data directory. When loading, CNPJ values are stripped of non-numeric characters:

@property
def companies(self):
    path = Path(self.path) / self.COMPANIES_DATASET
    df = pd.read_csv(path, dtype={'cnpj': np.str}, low_memory=False)
    df['cnpj'] = df['cnpj'].str.replace(r'\D', '')
    return df

Merging Datasets

The Adapter.dataset property merges both sources on the supplier tax ID:

@property
def dataset(self):
    self.update_datasets()
    df = self.reimbursements.merge(
        self.companies,
        how='left',
        left_on='cnpj_cpf',
        right_on='cnpj'
    )
    self.prepare_dataset(df)
    return df

A left join is used so that reimbursements without a matching CNPJ in the companies registry are still retained — they will simply have NaN values for company-related columns.

Column Normalization

After merging, columns are renamed to the Serenata de Amor standard:

RENAME_COLUMNS = {
    'subquota_description': 'category',
    'total_net_value': 'net_value',
    'cnpj_cpf': 'recipient_id',
    'supplier': 'recipient'
}

Two additional columns are created during normalization:

document_type — categorical: bill_of_sale, simple_receipt, or expense_made_abroad (document type codes 3, 4, 5 are treated as None pending clarification from the Chamber)
is_party_expense — boolean: True when congressperson_id is null (i.e. the expense belongs to a party rather than an individual)

The subquota_description value "Congressperson meal" is normalized to "Meal" to match what classifiers expect:

rename = {'Congressperson meal': 'Meal'}
df['subquota_description'] = df['subquota_description'].replace(rename)

Federal Senate Dataset

The Federal Senate adapter uses serenata-toolbox’s Dataset class to fetch, translate, and clean Senate reimbursement records:

def update_datasets(self):
    os.makedirs(self.path, exist_ok=True)
    federal_senate = Dataset(self.path)
    federal_senate.fetch()
    federal_senate.translate()
    federal_senate_reimbursements_path = federal_senate.clean()
    return federal_senate_reimbursements_path

Column names are mapped to the same standard as the Chamber module:

COLUMNS = {
    'net_value': 'reimbursement_value',
    'recipient_id': 'cnpj_cpf',
    'recipient': 'supplier',
}

Since the Senate data has no document_type column, all rows receive the value 'unknown' — which is a valid type accepted by InvalidCnpjCpfClassifier:

self._dataset['document_type'] = 'unknown'

Suspicions Output

After all classifiers have run, the Core engine writes a compressed CSV:

output = os.path.join(self.data_path, 'suspicions.xz')
kwargs = dict(compression='xz', encoding='utf-8', index=False)
self.suspicions.to_csv(output, **kwargs)

Chamber of Deputies output columns

The suspicions file is keyed by unique identifiers and then one column per classifier:

Column	Description
`applicant_id`	Unique ID of the person who filed the reimbursement
`year`	Quota year of the reimbursement
`document_id`	Document identifier
`meal_price_outlier`	`True` if the meal price was statistically abnormal
`over_monthly_subquota_limit`	`True` if the expense exceeded the subquota monthly cap
`suspicious_traveled_speed_day`	`True` if physically impossible travel was detected
`invalid_cnpj_cpf`	`True` if the supplier tax ID failed checksum validation
`election_expenses`	`True` if the supplier is a political candidate entity
`irregular_companies_classifier`	`True` if the supplier had an irregular registration status

Federal Senate output columns

Federal Senate suspicions retain the full dataset (no UNIQUE_IDS are defined) and include only:

Column	Description
`invalid_cnpj_cpf`	`True` if the supplier tax ID failed checksum validation

Trained classifier models are cached as .pkl files alongside the datasets. The MealPriceOutlierClassifier and TraveledSpeedsClassifier models are reused on subsequent runs, avoiding the cost of refitting from scratch.

MonthlySubquotaLimitClassifier is never cached because the fitted object exceeds a size joblib handles reliably. It is always retrained on each run from the full dataset.

Overview

Getting Started

Rosie (AI Engine)

Jarbas (Web Platform)

Contributing

Chamber of Deputies Datasets

Reimbursement Data

Companies Dataset

Merging Datasets

Column Normalization

Federal Senate Dataset

Suspicions Output

Chamber of Deputies output columns

Federal Senate output columns

Build docs developers (and LLMs) love

Overview

Getting Started

Rosie (AI Engine)

Jarbas (Web Platform)

Contributing

​Chamber of Deputies Datasets

​Reimbursement Data

​Companies Dataset

​Merging Datasets

​Column Normalization

​Federal Senate Dataset

​Suspicions Output

​Chamber of Deputies output columns

​Federal Senate output columns

Build docs developers (and LLMs) love

Chamber of Deputies Datasets

Reimbursement Data

Companies Dataset

Merging Datasets

Column Normalization

Federal Senate Dataset

Suspicions Output

Chamber of Deputies output columns

Federal Senate output columns