Classifiers

All classifiers extend scikit-learn’s TransformerMixin and implement a standard three-method interface: fit, transform, and predict. The Core engine calls them in sequence, writing each classifier’s output as a boolean column to the suspicions DataFrame.

Convention

Classifiers use one of two prediction conventions. The Core engine normalizes both to a boolean in suspicions.xz:

Classifier type	Raw prediction	Normalized to
Numeric (KMeans, polynomial)	`-1` = suspicious, `1` = normal	`True` / `False`
Boolean (rule-based)	`True` = suspicious, `False` = normal	used as-is

The Core.predict() method handles the conversion:

def predict(self, model, name):
    model.transform(self.dataset)
    prediction = model.predict(self.dataset)
    self.suspicions[name] = prediction
    if prediction.dtype == np.int:
        self.suspicions.loc[prediction == 1, name] = False
        self.suspicions.loc[prediction == -1, name] = True

Chamber of Deputies Classifiers

MealPriceOutlierClassifier

Detects meal reimbursements whose value is abnormally high compared to what other congresspeople paid at the same restaurant (identified by CNPJ). Source: rosie/chamber_of_deputies/classifiers/meal_price_outlier_classifier.py Settings key: meal_price_outlier

Required columns

Column	Type	Description
`applicant_id`	string	Personal identifier for the person making expenses
`category`	category	Expense category — only `"Meal"` rows are evaluated
`net_value`	float	Value of the expense in BRL
`recipient`	string	Name of the supplier
`recipient_id`	string	CNPJ or CPF of the supplier

Algorithm

The classifier uses KMeans clustering (k=3) on per-restaurant statistics:

self.cluster_model = KMeans(n_clusters=3)
self.cluster_model.fit(companies[self.CLUSTER_KEYS])  # CLUSTER_KEYS = ['mean', 'std']

Only rows where category == "Meal" and recipient_id is a 14-digit CNPJ are considered. Restaurants with “hotel” in their name are excluded to avoid conflating accommodation costs with meals.
For every restaurant with more than 3 congresspeople and more than 20 records, Rosie computes the mean and standard deviation of expenses.
KMeans groups those restaurants into 3 clusters. Each cluster gets a cluster-level threshold of cluster_mean + 4 * cluster_std.
For restaurants with enough data, an individual CNPJ-level threshold of company_mean + 3 * company_std overrides the cluster threshold.
Any expense above its applicable threshold is marked suspicious (y = -1).

_X['y'] = 1
is_outlier = self.__applicable_rows(_X) & \
    _X['threshold'].notnull() & \
    (_X['net_value'] > _X['threshold'])
_X.loc[is_outlier, 'y'] = -1

A restaurant must have served more than 3 distinct congresspeople and have more than 20 recorded transactions before it is included in clustering. This prevents small-sample noise from distorting the model.

TraveledSpeedsClassifier

Detects cases where a congressperson filed meal expenses at multiple restaurants on the same day in locations that would require physically impossible travel speeds to visit. Source: rosie/chamber_of_deputies/classifiers/traveled_speeds_classifier.py Settings key: suspicious_traveled_speed_day

Required columns

Column	Type	Description
`applicant_id`	category	Personal identifier for the expense applicant
`category`	category	Expense category — only `"Meal"` rows are evaluated
`is_party_expense`	bool	Whether the row is a party expense — party expenses are excluded
`issue_date`	datetime	Date when the expense was made
`latitude`	float	Latitude of the expense location
`longitude`	float	Longitude of the expense location

Algorithm

The classifier uses a polynomial regression baseline and a contamination-based threshold:

def fit(self, X):
    _X = self.__aggregate_dataset(X)
    self.polynomial = np.polyfit(
        _X['expenses'].astype(np.long),
        _X['distance_traveled'].astype(np.long),
        3
    )
    self._polynomial_fn = np.poly1d(self.polynomial)

Only meal expenses within Brazil’s geographic bounding box (latitude between −33.74 and 5.27, longitude between −74.00 and −34.79) are used.
For each (applicant_id, issue_date) pair, the total pairwise geodesic distance between all expense locations is computed using the Vincenty formula via geopy.
A degree-3 polynomial is fit to map the number of expenses in a day to the expected total distance traveled.
During prediction, two independent outlier flags are raised:
- expenses_threshold_outlier: more than 8 expenses in a single day (hardcoded limit)
- traveled_speed_outlier: |expected_distance - actual_distance| exceeds a contamination-calibrated threshold (default contamination = 0.001, i.e. top 0.1%)

is_outlier = self.__applicable_rows(_X) & \
    (_X['expenses_threshold_outlier'] | _X['traveled_speed_outlier'])
y = is_outlier.astype(np.int).replace({1: -1, 0: 1})

ElectionExpensesClassifier

Flags reimbursements paid to entities that are registered as political candidates in the Brazilian Federal Revenue database. Source: rosie/chamber_of_deputies/classifiers/election_expenses_classifier.py Settings key: election_expenses

Required columns

Column	Type	Description
`legal_entity`	string	Brazilian Federal Revenue category of the company, prefixed with its code

Algorithm

This is a rule-based classifier with a single equality check:

ELECTION_LEGAL_ENTITY = '409-0 - CANDIDATO A CARGO POLITICO ELETIVO'

def predict(self, dataframe):
    return dataframe['legal_entity'] == ELECTION_LEGAL_ENTITY

Any reimbursement whose supplier’s legal entity code is 409-0 (political candidate) is marked suspicious. Paying public money to a political campaign is a direct violation of the parliamentary quota rules.

IrregularCompaniesClassifier

Flags reimbursements paid to companies that had an irregular registration status with the Brazilian Federal Revenue at the time the expense was filed. Source: rosie/chamber_of_deputies/classifiers/irregular_companies_classifier.py Settings key: irregular_companies_classifier

Required columns

Column	Type	Description
`issue_date`	datetime	Date when the expense was made
`situation`	string	Company registration status from the Brazilian Federal Revenue
`situation_date`	datetime	Date when the situation was last updated

Algorithm

A rule-based classifier combining a date comparison with a status allowlist:

def predict(self, X):
    statuses = ['BAIXADA', 'NULA', 'SUSPENSA', 'INAPTA']
    self._X = X.apply(self.__compare_date, axis=1)
    return np.r_[self._X & X['situation'].isin(statuses)]

def __compare_date(self, row):
    return (row['situation_date'] < row['issue_date'])

A company is considered irregular if:

Its situation_date is before the issue_date of the expense (the status was already set before the purchase), and
Its situation is one of: BAIXADA (deregistered), NULA (null), SUSPENSA (suspended), or INAPTA (unfit).

This combination prevents false positives from companies whose status changed after the expense date.

MonthlySubquotaLimitClassifier

Detects expenses that push a congressperson over the official monthly spending cap for a given subquota category. Source: rosie/chamber_of_deputies/classifiers/monthly_subquota_limit_classifier.py Settings key: over_monthly_subquota_limit

Required columns

Column	Type	Description
`applicant_id`	string	Personal identifier for the expense applicant
`issue_date`	datetime	Date when the expense was made
`month`	int	Quota month for the expense
`net_value`	float	Value of the expense in BRL
`subquota_number`	category	Numeric code for the expense subquota category
`year`	int	Quota year for the expense

Algorithm

A rule-based cumulative sum approach using time-period limits defined directly in the code:

def predict(self, X=None):
    self._X['is_over_monthly_subquota_limit'] = False
    for metadata in self.limits:
        data, monthly_limit = metadata['data'], metadata['monthly_limit']
        if len(data):
            surplus_reimbursements = self.__find_surplus_reimbursements(data, monthly_limit)
            self._X.loc[surplus_reimbursements.index,
                        'is_over_monthly_subquota_limit'] = True
    results = self._X.loc[self.X.index, 'is_over_monthly_subquota_limit']
    return np.r_[results]

For each subquota category and date range, expenses are sorted chronologically and a running total (cumsum) is computed per (applicant_id, month, year). Any individual expense that causes the cumsum to exceed the monthly limit is flagged. Tracked subquotas and their limits:

Subquota	Period	Monthly limit (BRL cents)
`120` — Automotive vehicle renting	Dec 2013 – Mar 2015	R$ 10,000.00
`120` — Automotive vehicle renting	Apr 2015 – Apr 2017	R$ 10,900.00
`120` — Automotive vehicle renting	May 2017 onwards	R$ 12,713.00
`122` — Taxi, toll and parking	Dec 2013 – Mar 2015	R$ 2,500.00
`122` — Taxi, toll and parking	Apr 2015 onwards	R$ 2,700.00
`3` — Fuels and lubricants	Jul 2009 – Mar 2015	R$ 4,500.00
`3` — Fuels and lubricants	Apr 2015 – Aug 2015	R$ 4,900.00
`3` — Fuels and lubricants	Sep 2015 onwards	R$ 6,000.00
`8` — Security service	Jul 2009 – Apr 2014	R$ 4,500.00
`8` — Security service	May 2014 – Mar 2015	R$ 8,000.00
`8` — Security service	Apr 2015 onwards	R$ 8,700.00
`137` — Course/event participation	Oct 2015 onwards	R$ 7,697.16

MonthlySubquotaLimitClassifier is intentionally excluded from joblib model caching because the trained model exceeds a size that joblib handles reliably. It is always retrained on each run.

Core Classifier

InvalidCnpjCpfClassifier

Validates that each reimbursement’s supplier tax ID (recipient_id) is a mathematically valid Brazilian CNPJ or CPF. An incorrectly formatted ID may indicate a fraudulent or fictitious supplier. Source: rosie/core/classifiers/invalid_cnpj_cpf_classifier.py Settings key: invalid_cnpj_cpf Modules: Chamber of Deputies, Federal Senate

Required columns

Column	Type	Description
`document_type`	category	Document type — only `bill_of_sale`, `simple_receipt`, and `unknown` are validated
`recipient_id`	string	CNPJ (14 digits) or CPF (11 digits) of the supplier

Algorithm

A rule-based classifier using the brutils library for checksum validation:

def predict(self, dataframe):
    def is_invalid(row):
        valid_cpf = cpf.validate(str(row['recipient_id']).zfill(11))
        valid_cnpj = cnpj.validate(str(row['recipient_id']).zfill(14))
        good_doctype = row['document_type'] in ('bill_of_sale', 'simple_receipt', 'unknown')
        return good_doctype and (not (valid_cpf or valid_cnpj))
    return np.r_[dataframe.apply(is_invalid, axis=1)]

The recipient_id is zero-padded to the appropriate length before validation. A row is flagged if the document type warrants validation and the ID fails both CPF and CNPJ checksum tests.

The unknown document type is used exclusively for Federal Senate data, which does not include a document type column. All Senate rows are treated as requiring validation.

Classifier Summary

Classifier	Module	Algorithm	Key column(s)
`MealPriceOutlierClassifier`	Chamber	KMeans (k=3)	`net_value`, `recipient_id`, `category`
`TraveledSpeedsClassifier`	Chamber	Polynomial regression + Vincenty distance	`latitude`, `longitude`, `issue_date`
`ElectionExpensesClassifier`	Chamber	Rule-based equality	`legal_entity`
`IrregularCompaniesClassifier`	Chamber	Rule-based date + status	`situation`, `situation_date`, `issue_date`
`MonthlySubquotaLimitClassifier`	Chamber	Cumulative sum against time-period limits	`net_value`, `subquota_number`, `month`, `year`
`InvalidCnpjCpfClassifier`	Core (both)	Checksum validation via brutils	`recipient_id`, `document_type`

Overview

Getting Started

Rosie (AI Engine)

Jarbas (Web Platform)

Contributing

Convention

Chamber of Deputies Classifiers

MealPriceOutlierClassifier

Required columns

Algorithm

TraveledSpeedsClassifier

Required columns

Algorithm

ElectionExpensesClassifier

Required columns

Algorithm

IrregularCompaniesClassifier

Required columns

Algorithm

MonthlySubquotaLimitClassifier

Required columns

Algorithm

Core Classifier

InvalidCnpjCpfClassifier

Required columns

Algorithm

Classifier Summary

Build docs developers (and LLMs) love

Overview

Getting Started

Rosie (AI Engine)

Jarbas (Web Platform)

Contributing

​Convention

​Chamber of Deputies Classifiers

​MealPriceOutlierClassifier

​Required columns

​Algorithm

​TraveledSpeedsClassifier

​Required columns

​Algorithm

​ElectionExpensesClassifier

​Required columns

​Algorithm

​IrregularCompaniesClassifier

​Required columns

​Algorithm

​MonthlySubquotaLimitClassifier

​Required columns

​Algorithm

​Core Classifier

​InvalidCnpjCpfClassifier

​Required columns

​Algorithm

​Classifier Summary

Build docs developers (and LLMs) love

Convention

Chamber of Deputies Classifiers

MealPriceOutlierClassifier

Required columns

Algorithm

TraveledSpeedsClassifier

Required columns

Algorithm

ElectionExpensesClassifier

Required columns

Algorithm

IrregularCompaniesClassifier

Required columns

Algorithm

MonthlySubquotaLimitClassifier

Required columns

Algorithm

Core Classifier

InvalidCnpjCpfClassifier

Required columns

Algorithm

Classifier Summary