All classifiers extend scikit-learn’s TransformerMixin and implement a standard three-method interface: fit, transform, and predict. The Core engine calls them in sequence, writing each classifier’s output as a boolean column to the suspicions DataFrame.
Convention
Classifiers use one of two prediction conventions. The Core engine normalizes both to a boolean in suspicions.xz:
| Classifier type | Raw prediction | Normalized to |
|---|
| Numeric (KMeans, polynomial) | -1 = suspicious, 1 = normal | True / False |
| Boolean (rule-based) | True = suspicious, False = normal | used as-is |
The Core.predict() method handles the conversion:
def predict(self, model, name):
model.transform(self.dataset)
prediction = model.predict(self.dataset)
self.suspicions[name] = prediction
if prediction.dtype == np.int:
self.suspicions.loc[prediction == 1, name] = False
self.suspicions.loc[prediction == -1, name] = True
Chamber of Deputies Classifiers
MealPriceOutlierClassifier
Detects meal reimbursements whose value is abnormally high compared to what other congresspeople paid at the same restaurant (identified by CNPJ).
Source: rosie/chamber_of_deputies/classifiers/meal_price_outlier_classifier.py
Settings key: meal_price_outlier
Required columns
| Column | Type | Description |
|---|
applicant_id | string | Personal identifier for the person making expenses |
category | category | Expense category — only "Meal" rows are evaluated |
net_value | float | Value of the expense in BRL |
recipient | string | Name of the supplier |
recipient_id | string | CNPJ or CPF of the supplier |
Algorithm
The classifier uses KMeans clustering (k=3) on per-restaurant statistics:
self.cluster_model = KMeans(n_clusters=3)
self.cluster_model.fit(companies[self.CLUSTER_KEYS]) # CLUSTER_KEYS = ['mean', 'std']
- Only rows where
category == "Meal" and recipient_id is a 14-digit CNPJ are considered. Restaurants with “hotel” in their name are excluded to avoid conflating accommodation costs with meals.
- For every restaurant with more than 3 congresspeople and more than 20 records, Rosie computes the mean and standard deviation of expenses.
- KMeans groups those restaurants into 3 clusters. Each cluster gets a cluster-level threshold of
cluster_mean + 4 * cluster_std.
- For restaurants with enough data, an individual CNPJ-level threshold of
company_mean + 3 * company_std overrides the cluster threshold.
- Any expense above its applicable threshold is marked suspicious (
y = -1).
_X['y'] = 1
is_outlier = self.__applicable_rows(_X) & \
_X['threshold'].notnull() & \
(_X['net_value'] > _X['threshold'])
_X.loc[is_outlier, 'y'] = -1
A restaurant must have served more than 3 distinct congresspeople and have more than 20 recorded transactions before it is included in clustering. This prevents small-sample noise from distorting the model.
TraveledSpeedsClassifier
Detects cases where a congressperson filed meal expenses at multiple restaurants on the same day in locations that would require physically impossible travel speeds to visit.
Source: rosie/chamber_of_deputies/classifiers/traveled_speeds_classifier.py
Settings key: suspicious_traveled_speed_day
Required columns
| Column | Type | Description |
|---|
applicant_id | category | Personal identifier for the expense applicant |
category | category | Expense category — only "Meal" rows are evaluated |
is_party_expense | bool | Whether the row is a party expense — party expenses are excluded |
issue_date | datetime | Date when the expense was made |
latitude | float | Latitude of the expense location |
longitude | float | Longitude of the expense location |
Algorithm
The classifier uses a polynomial regression baseline and a contamination-based threshold:
def fit(self, X):
_X = self.__aggregate_dataset(X)
self.polynomial = np.polyfit(
_X['expenses'].astype(np.long),
_X['distance_traveled'].astype(np.long),
3
)
self._polynomial_fn = np.poly1d(self.polynomial)
- Only meal expenses within Brazil’s geographic bounding box (
latitude between −33.74 and 5.27, longitude between −74.00 and −34.79) are used.
- For each
(applicant_id, issue_date) pair, the total pairwise geodesic distance between all expense locations is computed using the Vincenty formula via geopy.
- A degree-3 polynomial is fit to map the number of expenses in a day to the expected total distance traveled.
- During prediction, two independent outlier flags are raised:
expenses_threshold_outlier: more than 8 expenses in a single day (hardcoded limit)
traveled_speed_outlier: |expected_distance - actual_distance| exceeds a contamination-calibrated threshold (default contamination = 0.001, i.e. top 0.1%)
is_outlier = self.__applicable_rows(_X) & \
(_X['expenses_threshold_outlier'] | _X['traveled_speed_outlier'])
y = is_outlier.astype(np.int).replace({1: -1, 0: 1})
ElectionExpensesClassifier
Flags reimbursements paid to entities that are registered as political candidates in the Brazilian Federal Revenue database.
Source: rosie/chamber_of_deputies/classifiers/election_expenses_classifier.py
Settings key: election_expenses
Required columns
| Column | Type | Description |
|---|
legal_entity | string | Brazilian Federal Revenue category of the company, prefixed with its code |
Algorithm
This is a rule-based classifier with a single equality check:
ELECTION_LEGAL_ENTITY = '409-0 - CANDIDATO A CARGO POLITICO ELETIVO'
def predict(self, dataframe):
return dataframe['legal_entity'] == ELECTION_LEGAL_ENTITY
Any reimbursement whose supplier’s legal entity code is 409-0 (political candidate) is marked suspicious. Paying public money to a political campaign is a direct violation of the parliamentary quota rules.
IrregularCompaniesClassifier
Flags reimbursements paid to companies that had an irregular registration status with the Brazilian Federal Revenue at the time the expense was filed.
Source: rosie/chamber_of_deputies/classifiers/irregular_companies_classifier.py
Settings key: irregular_companies_classifier
Required columns
| Column | Type | Description |
|---|
issue_date | datetime | Date when the expense was made |
situation | string | Company registration status from the Brazilian Federal Revenue |
situation_date | datetime | Date when the situation was last updated |
Algorithm
A rule-based classifier combining a date comparison with a status allowlist:
def predict(self, X):
statuses = ['BAIXADA', 'NULA', 'SUSPENSA', 'INAPTA']
self._X = X.apply(self.__compare_date, axis=1)
return np.r_[self._X & X['situation'].isin(statuses)]
def __compare_date(self, row):
return (row['situation_date'] < row['issue_date'])
A company is considered irregular if:
- Its
situation_date is before the issue_date of the expense (the status was already set before the purchase), and
- Its
situation is one of: BAIXADA (deregistered), NULA (null), SUSPENSA (suspended), or INAPTA (unfit).
This combination prevents false positives from companies whose status changed after the expense date.
MonthlySubquotaLimitClassifier
Detects expenses that push a congressperson over the official monthly spending cap for a given subquota category.
Source: rosie/chamber_of_deputies/classifiers/monthly_subquota_limit_classifier.py
Settings key: over_monthly_subquota_limit
Required columns
| Column | Type | Description |
|---|
applicant_id | string | Personal identifier for the expense applicant |
issue_date | datetime | Date when the expense was made |
month | int | Quota month for the expense |
net_value | float | Value of the expense in BRL |
subquota_number | category | Numeric code for the expense subquota category |
year | int | Quota year for the expense |
Algorithm
A rule-based cumulative sum approach using time-period limits defined directly in the code:
def predict(self, X=None):
self._X['is_over_monthly_subquota_limit'] = False
for metadata in self.limits:
data, monthly_limit = metadata['data'], metadata['monthly_limit']
if len(data):
surplus_reimbursements = self.__find_surplus_reimbursements(data, monthly_limit)
self._X.loc[surplus_reimbursements.index,
'is_over_monthly_subquota_limit'] = True
results = self._X.loc[self.X.index, 'is_over_monthly_subquota_limit']
return np.r_[results]
For each subquota category and date range, expenses are sorted chronologically and a running total (cumsum) is computed per (applicant_id, month, year). Any individual expense that causes the cumsum to exceed the monthly limit is flagged.
Tracked subquotas and their limits:
| Subquota | Period | Monthly limit (BRL cents) |
|---|
120 — Automotive vehicle renting | Dec 2013 – Mar 2015 | R$ 10,000.00 |
120 — Automotive vehicle renting | Apr 2015 – Apr 2017 | R$ 10,900.00 |
120 — Automotive vehicle renting | May 2017 onwards | R$ 12,713.00 |
122 — Taxi, toll and parking | Dec 2013 – Mar 2015 | R$ 2,500.00 |
122 — Taxi, toll and parking | Apr 2015 onwards | R$ 2,700.00 |
3 — Fuels and lubricants | Jul 2009 – Mar 2015 | R$ 4,500.00 |
3 — Fuels and lubricants | Apr 2015 – Aug 2015 | R$ 4,900.00 |
3 — Fuels and lubricants | Sep 2015 onwards | R$ 6,000.00 |
8 — Security service | Jul 2009 – Apr 2014 | R$ 4,500.00 |
8 — Security service | May 2014 – Mar 2015 | R$ 8,000.00 |
8 — Security service | Apr 2015 onwards | R$ 8,700.00 |
137 — Course/event participation | Oct 2015 onwards | R$ 7,697.16 |
MonthlySubquotaLimitClassifier is intentionally excluded from joblib model caching because the trained model exceeds a size that joblib handles reliably. It is always retrained on each run.
Core Classifier
InvalidCnpjCpfClassifier
Validates that each reimbursement’s supplier tax ID (recipient_id) is a mathematically valid Brazilian CNPJ or CPF. An incorrectly formatted ID may indicate a fraudulent or fictitious supplier.
Source: rosie/core/classifiers/invalid_cnpj_cpf_classifier.py
Settings key: invalid_cnpj_cpf
Modules: Chamber of Deputies, Federal Senate
Required columns
| Column | Type | Description |
|---|
document_type | category | Document type — only bill_of_sale, simple_receipt, and unknown are validated |
recipient_id | string | CNPJ (14 digits) or CPF (11 digits) of the supplier |
Algorithm
A rule-based classifier using the brutils library for checksum validation:
def predict(self, dataframe):
def is_invalid(row):
valid_cpf = cpf.validate(str(row['recipient_id']).zfill(11))
valid_cnpj = cnpj.validate(str(row['recipient_id']).zfill(14))
good_doctype = row['document_type'] in ('bill_of_sale', 'simple_receipt', 'unknown')
return good_doctype and (not (valid_cpf or valid_cnpj))
return np.r_[dataframe.apply(is_invalid, axis=1)]
The recipient_id is zero-padded to the appropriate length before validation. A row is flagged if the document type warrants validation and the ID fails both CPF and CNPJ checksum tests.
The unknown document type is used exclusively for Federal Senate data, which does not include a document type column. All Senate rows are treated as requiring validation.
Classifier Summary
| Classifier | Module | Algorithm | Key column(s) |
|---|
MealPriceOutlierClassifier | Chamber | KMeans (k=3) | net_value, recipient_id, category |
TraveledSpeedsClassifier | Chamber | Polynomial regression + Vincenty distance | latitude, longitude, issue_date |
ElectionExpensesClassifier | Chamber | Rule-based equality | legal_entity |
IrregularCompaniesClassifier | Chamber | Rule-based date + status | situation, situation_date, issue_date |
MonthlySubquotaLimitClassifier | Chamber | Cumulative sum against time-period limits | net_value, subquota_number, month, year |
InvalidCnpjCpfClassifier | Core (both) | Checksum validation via brutils | recipient_id, document_type |