Skip to main content
All classifiers extend scikit-learn’s TransformerMixin and implement a standard three-method interface: fit, transform, and predict. The Core engine calls them in sequence, writing each classifier’s output as a boolean column to the suspicions DataFrame.

Convention

Classifiers use one of two prediction conventions. The Core engine normalizes both to a boolean in suspicions.xz:
Classifier typeRaw predictionNormalized to
Numeric (KMeans, polynomial)-1 = suspicious, 1 = normalTrue / False
Boolean (rule-based)True = suspicious, False = normalused as-is
The Core.predict() method handles the conversion:
def predict(self, model, name):
    model.transform(self.dataset)
    prediction = model.predict(self.dataset)
    self.suspicions[name] = prediction
    if prediction.dtype == np.int:
        self.suspicions.loc[prediction == 1, name] = False
        self.suspicions.loc[prediction == -1, name] = True

Chamber of Deputies Classifiers

MealPriceOutlierClassifier

Detects meal reimbursements whose value is abnormally high compared to what other congresspeople paid at the same restaurant (identified by CNPJ). Source: rosie/chamber_of_deputies/classifiers/meal_price_outlier_classifier.py Settings key: meal_price_outlier

Required columns

ColumnTypeDescription
applicant_idstringPersonal identifier for the person making expenses
categorycategoryExpense category — only "Meal" rows are evaluated
net_valuefloatValue of the expense in BRL
recipientstringName of the supplier
recipient_idstringCNPJ or CPF of the supplier

Algorithm

The classifier uses KMeans clustering (k=3) on per-restaurant statistics:
self.cluster_model = KMeans(n_clusters=3)
self.cluster_model.fit(companies[self.CLUSTER_KEYS])  # CLUSTER_KEYS = ['mean', 'std']
  1. Only rows where category == "Meal" and recipient_id is a 14-digit CNPJ are considered. Restaurants with “hotel” in their name are excluded to avoid conflating accommodation costs with meals.
  2. For every restaurant with more than 3 congresspeople and more than 20 records, Rosie computes the mean and standard deviation of expenses.
  3. KMeans groups those restaurants into 3 clusters. Each cluster gets a cluster-level threshold of cluster_mean + 4 * cluster_std.
  4. For restaurants with enough data, an individual CNPJ-level threshold of company_mean + 3 * company_std overrides the cluster threshold.
  5. Any expense above its applicable threshold is marked suspicious (y = -1).
_X['y'] = 1
is_outlier = self.__applicable_rows(_X) & \
    _X['threshold'].notnull() & \
    (_X['net_value'] > _X['threshold'])
_X.loc[is_outlier, 'y'] = -1
A restaurant must have served more than 3 distinct congresspeople and have more than 20 recorded transactions before it is included in clustering. This prevents small-sample noise from distorting the model.

TraveledSpeedsClassifier

Detects cases where a congressperson filed meal expenses at multiple restaurants on the same day in locations that would require physically impossible travel speeds to visit. Source: rosie/chamber_of_deputies/classifiers/traveled_speeds_classifier.py Settings key: suspicious_traveled_speed_day

Required columns

ColumnTypeDescription
applicant_idcategoryPersonal identifier for the expense applicant
categorycategoryExpense category — only "Meal" rows are evaluated
is_party_expenseboolWhether the row is a party expense — party expenses are excluded
issue_datedatetimeDate when the expense was made
latitudefloatLatitude of the expense location
longitudefloatLongitude of the expense location

Algorithm

The classifier uses a polynomial regression baseline and a contamination-based threshold:
def fit(self, X):
    _X = self.__aggregate_dataset(X)
    self.polynomial = np.polyfit(
        _X['expenses'].astype(np.long),
        _X['distance_traveled'].astype(np.long),
        3
    )
    self._polynomial_fn = np.poly1d(self.polynomial)
  1. Only meal expenses within Brazil’s geographic bounding box (latitude between −33.74 and 5.27, longitude between −74.00 and −34.79) are used.
  2. For each (applicant_id, issue_date) pair, the total pairwise geodesic distance between all expense locations is computed using the Vincenty formula via geopy.
  3. A degree-3 polynomial is fit to map the number of expenses in a day to the expected total distance traveled.
  4. During prediction, two independent outlier flags are raised:
    • expenses_threshold_outlier: more than 8 expenses in a single day (hardcoded limit)
    • traveled_speed_outlier: |expected_distance - actual_distance| exceeds a contamination-calibrated threshold (default contamination = 0.001, i.e. top 0.1%)
is_outlier = self.__applicable_rows(_X) & \
    (_X['expenses_threshold_outlier'] | _X['traveled_speed_outlier'])
y = is_outlier.astype(np.int).replace({1: -1, 0: 1})

ElectionExpensesClassifier

Flags reimbursements paid to entities that are registered as political candidates in the Brazilian Federal Revenue database. Source: rosie/chamber_of_deputies/classifiers/election_expenses_classifier.py Settings key: election_expenses

Required columns

ColumnTypeDescription
legal_entitystringBrazilian Federal Revenue category of the company, prefixed with its code

Algorithm

This is a rule-based classifier with a single equality check:
ELECTION_LEGAL_ENTITY = '409-0 - CANDIDATO A CARGO POLITICO ELETIVO'

def predict(self, dataframe):
    return dataframe['legal_entity'] == ELECTION_LEGAL_ENTITY
Any reimbursement whose supplier’s legal entity code is 409-0 (political candidate) is marked suspicious. Paying public money to a political campaign is a direct violation of the parliamentary quota rules.

IrregularCompaniesClassifier

Flags reimbursements paid to companies that had an irregular registration status with the Brazilian Federal Revenue at the time the expense was filed. Source: rosie/chamber_of_deputies/classifiers/irregular_companies_classifier.py Settings key: irregular_companies_classifier

Required columns

ColumnTypeDescription
issue_datedatetimeDate when the expense was made
situationstringCompany registration status from the Brazilian Federal Revenue
situation_datedatetimeDate when the situation was last updated

Algorithm

A rule-based classifier combining a date comparison with a status allowlist:
def predict(self, X):
    statuses = ['BAIXADA', 'NULA', 'SUSPENSA', 'INAPTA']
    self._X = X.apply(self.__compare_date, axis=1)
    return np.r_[self._X & X['situation'].isin(statuses)]

def __compare_date(self, row):
    return (row['situation_date'] < row['issue_date'])
A company is considered irregular if:
  1. Its situation_date is before the issue_date of the expense (the status was already set before the purchase), and
  2. Its situation is one of: BAIXADA (deregistered), NULA (null), SUSPENSA (suspended), or INAPTA (unfit).
This combination prevents false positives from companies whose status changed after the expense date.

MonthlySubquotaLimitClassifier

Detects expenses that push a congressperson over the official monthly spending cap for a given subquota category. Source: rosie/chamber_of_deputies/classifiers/monthly_subquota_limit_classifier.py Settings key: over_monthly_subquota_limit

Required columns

ColumnTypeDescription
applicant_idstringPersonal identifier for the expense applicant
issue_datedatetimeDate when the expense was made
monthintQuota month for the expense
net_valuefloatValue of the expense in BRL
subquota_numbercategoryNumeric code for the expense subquota category
yearintQuota year for the expense

Algorithm

A rule-based cumulative sum approach using time-period limits defined directly in the code:
def predict(self, X=None):
    self._X['is_over_monthly_subquota_limit'] = False
    for metadata in self.limits:
        data, monthly_limit = metadata['data'], metadata['monthly_limit']
        if len(data):
            surplus_reimbursements = self.__find_surplus_reimbursements(data, monthly_limit)
            self._X.loc[surplus_reimbursements.index,
                        'is_over_monthly_subquota_limit'] = True
    results = self._X.loc[self.X.index, 'is_over_monthly_subquota_limit']
    return np.r_[results]
For each subquota category and date range, expenses are sorted chronologically and a running total (cumsum) is computed per (applicant_id, month, year). Any individual expense that causes the cumsum to exceed the monthly limit is flagged. Tracked subquotas and their limits:
SubquotaPeriodMonthly limit (BRL cents)
120 — Automotive vehicle rentingDec 2013 – Mar 2015R$ 10,000.00
120 — Automotive vehicle rentingApr 2015 – Apr 2017R$ 10,900.00
120 — Automotive vehicle rentingMay 2017 onwardsR$ 12,713.00
122 — Taxi, toll and parkingDec 2013 – Mar 2015R$ 2,500.00
122 — Taxi, toll and parkingApr 2015 onwardsR$ 2,700.00
3 — Fuels and lubricantsJul 2009 – Mar 2015R$ 4,500.00
3 — Fuels and lubricantsApr 2015 – Aug 2015R$ 4,900.00
3 — Fuels and lubricantsSep 2015 onwardsR$ 6,000.00
8 — Security serviceJul 2009 – Apr 2014R$ 4,500.00
8 — Security serviceMay 2014 – Mar 2015R$ 8,000.00
8 — Security serviceApr 2015 onwardsR$ 8,700.00
137 — Course/event participationOct 2015 onwardsR$ 7,697.16
MonthlySubquotaLimitClassifier is intentionally excluded from joblib model caching because the trained model exceeds a size that joblib handles reliably. It is always retrained on each run.

Core Classifier

InvalidCnpjCpfClassifier

Validates that each reimbursement’s supplier tax ID (recipient_id) is a mathematically valid Brazilian CNPJ or CPF. An incorrectly formatted ID may indicate a fraudulent or fictitious supplier. Source: rosie/core/classifiers/invalid_cnpj_cpf_classifier.py Settings key: invalid_cnpj_cpf Modules: Chamber of Deputies, Federal Senate

Required columns

ColumnTypeDescription
document_typecategoryDocument type — only bill_of_sale, simple_receipt, and unknown are validated
recipient_idstringCNPJ (14 digits) or CPF (11 digits) of the supplier

Algorithm

A rule-based classifier using the brutils library for checksum validation:
def predict(self, dataframe):
    def is_invalid(row):
        valid_cpf = cpf.validate(str(row['recipient_id']).zfill(11))
        valid_cnpj = cnpj.validate(str(row['recipient_id']).zfill(14))
        good_doctype = row['document_type'] in ('bill_of_sale', 'simple_receipt', 'unknown')
        return good_doctype and (not (valid_cpf or valid_cnpj))
    return np.r_[dataframe.apply(is_invalid, axis=1)]
The recipient_id is zero-padded to the appropriate length before validation. A row is flagged if the document type warrants validation and the ID fails both CPF and CNPJ checksum tests.
The unknown document type is used exclusively for Federal Senate data, which does not include a document type column. All Senate rows are treated as requiring validation.

Classifier Summary

ClassifierModuleAlgorithmKey column(s)
MealPriceOutlierClassifierChamberKMeans (k=3)net_value, recipient_id, category
TraveledSpeedsClassifierChamberPolynomial regression + Vincenty distancelatitude, longitude, issue_date
ElectionExpensesClassifierChamberRule-based equalitylegal_entity
IrregularCompaniesClassifierChamberRule-based date + statussituation, situation_date, issue_date
MonthlySubquotaLimitClassifierChamberCumulative sum against time-period limitsnet_value, subquota_number, month, year
InvalidCnpjCpfClassifierCore (both)Checksum validation via brutilsrecipient_id, document_type

Build docs developers (and LLMs) love