How it works

Serenata de Amor works as a pipeline. Open government data enters one end; a public record of suspicious reimbursements, a searchable dashboard, and tweets engaging citizens come out the other.

The pipeline at a glance

Download open data

Rosie uses the serenata-toolbox pip package to download reimbursement records from the Brazilian Chamber of Deputies and Federal Senate. Data is fetched as CSV files per year, starting from 2009, and a companies dataset is fetched alongside for cross-referencing.

Prepare the dataset

Rosie’s Adapter merges reimbursement records with company data, normalizes column names, coerces date types, and categorizes document types. The result is a single cleaned pandas DataFrame ready for classification.

Run the classifiers

Rosie’s Core object iterates over every configured classifier in sequence. Each classifier fits a model on the dataset and predicts whether each row is suspicious or not. Numeric classifiers return -1 (suspicious) or 1 (normal); rule-based classifiers return boolean True/False. The Core engine normalizes all predictions to boolean columns in the suspicions DataFrame.

Output suspicions.xz

After all classifiers have run, the suspicions DataFrame is written to a compressed CSV file at /tmp/serenata-data/suspicions.xz. Each row maps a unique reimbursement (identified by applicant_id, year, and document_id) to a boolean column for each classifier.

Import into Jarbas

A Django management command loads suspicions.xz into the PostgreSQL database. Separate commands load the full reimbursements dataset and company records. A searchvector command then builds a PostgreSQL full-text search index.

Serve the dashboard and API

Jarbas runs a Django REST Framework API and an Elm-based frontend dashboard. Citizens can browse reimbursements, filter by suspicion type, congressperson, date, state, party, and more.

Tweet suspicious findings

The tweets management command instructs Jarbas to post about suspicious reimbursements on Twitter as @RosieDaSerenata, tagging the relevant congressperson and inviting public scrutiny.

Downloading reimbursement data

Rosie delegates all data fetching to the serenata-toolbox package. Inside rosie/rosie/chamber_of_deputies/adapter.py, the Adapter class handles this:

from serenata_toolbox.chamber_of_deputies.reimbursements import Reimbursements
from serenata_toolbox.datasets import fetch

class Adapter:
    STARTING_YEAR = 2009
    COMPANIES_DATASET = '2016-09-03-companies.xz'

    def update_reimbursements(self, years=None):
        if not years:
            next_year = date.today().year + 1
            years = range(self.STARTING_YEAR, next_year)

        for year in years:
            Reimbursements(year, self.path)()

Reimbursement CSV files are downloaded for every year from 2009 to the current year. A companies dataset is also fetched and merged so that each reimbursement row includes the company’s registration details.

The classifiers

Rosie’s Core object runs each classifier in sequence. Every classifier implements the scikit-learn TransformerMixin interface with fit, transform, and predict methods.

MealPriceOutlierClassifier

Detects meal expenses whose price is a statistical outlier compared to other reimbursements at the same restaurant. Uses KMeans clustering to group restaurants and flags any expense where the net value exceeds the cluster threshold (mean + 4 standard deviations, or mean + 3 standard deviations for well-known companies).Key column: category == "Meal", recipient_id (14-digit CNPJ only)

TraveledSpeedsClassifier

Detects reimbursements that would require the congressperson to have traveled at an implausible speed between two locations on the same day. It calculates the geographic distance between expense locations and checks whether the implied travel speed is physically impossible.Key columns: issue_date, applicant_id, latitude, longitude

ElectionExpensesClassifier

Flags reimbursements made to companies whose Brazilian Federal Revenue legal entity category is 409-0 - CANDIDATO A CARGO POLITICO ELETIVO — entities registered as electoral candidates. Congressional funds should not be spent at such entities.Key column: legal_entity

IrregularCompaniesClassifier

Checks the official registration status of the supplier company in the Brazilian Federal Revenue. Flags reimbursements to companies that were in an irregular, suspended, or closed state at the time the expense was made.Key columns: situation, situation_date, issue_date

MonthlySubquotaLimitClassifier

Detects cases where a congressperson’s total reimbursements for a given subquota category in a single month exceed the legal limit. Each subquota (expense category) has a defined monthly ceiling; this classifier sums expenditure per applicant, month, year, and subquota and flags overruns.Key columns: applicant_id, month, year, subquota_number, net_value

InvalidCnpjCpfClassifier

Validates the recipient_id field — either a CNPJ (Brazilian company tax ID) or a CPF (Brazilian personal tax ID) — by computing the expected check digit and comparing it to the submitted value. An invalid ID may indicate a fictitious supplier.Key columns: document_type, recipient_id

The classifiers are configured in rosie/rosie/chamber_of_deputies/settings.py. Each classifier is keyed by a human-readable snake_case name that becomes a column in the output suspicions.xz file (e.g., meal_price_outlier, invalid_cnpj_cpf).

The suspicions.xz output

After all classifiers run, Core writes the results:

output = os.path.join(self.data_path, 'suspicions.xz')
kwargs = dict(compression='xz', encoding='utf-8', index=False)
self.suspicions.to_csv(output, **kwargs)

The file is a pandas-written, XZ-compressed CSV. Each row represents one reimbursement, identified by applicant_id, year, and document_id. Columns correspond to each classifier’s name and contain True (suspicious) or False (not suspicious).

Loading data into Jarbas

Jarbas provides Django management commands to import Rosie’s output:

# Load the full reimbursements dataset (CSV format)
docker-compose run --rm django python manage.py reimbursements /mnt/data/reimbursements.csv

# Load company data
docker-compose run --rm django python manage.py companies /mnt/data/companies.xz

# Load Rosie's suspicions
docker-compose run --rm django python manage.py suspicions /mnt/data/suspicions.xz

# Build the full-text search index
docker-compose run --rm django python manage.py searchvector

# Post suspicious findings to Twitter
docker-compose run --rm django python manage.py tweets

The update cycle

Despite Jarbas running 24/7, the core analysis is not automated end-to-end. Roughly once a month, the team manually runs Rosie and updates Jarbas with fresh data. New versioned datasets are also uploaded to the toolbox a few times per year.

Always run Rosie against a fresh dataset by invoking serenata-toolbox before generating a new suspicions.xz. Using a stale dataset may miss recent reimbursements.

Jarbas REST API

Once data is loaded, Jarbas exposes a REST API for programmatic access. For example:

GET /api/chamber_of_deputies/reimbursement/?year=2016&suspicions=1&order_by=probability

This returns all 2016 reimbursements that Rosie flagged as suspicious, ordered by probability. See the API reference for the full list of endpoints and filter parameters.

Overview

Getting Started

Rosie (AI Engine)

Jarbas (Web Platform)

Contributing

The pipeline at a glance

Downloading reimbursement data

The classifiers

The suspicions.xz output

Loading data into Jarbas

The update cycle

Jarbas REST API

Build docs developers (and LLMs) love

Overview

Getting Started

Rosie (AI Engine)

Jarbas (Web Platform)

Contributing

​The pipeline at a glance

​Downloading reimbursement data

​The classifiers

​The suspicions.xz output

​Loading data into Jarbas

​The update cycle

​Jarbas REST API

Build docs developers (and LLMs) love

The pipeline at a glance

Downloading reimbursement data

The classifiers

The suspicions.xz output

Loading data into Jarbas

The update cycle

Jarbas REST API