Skip to main content
Serenata de Amor consists of two independently deployable subsystems — Rosie and Jarbas — connected by a shared data artifact: suspicions.xz.

System overview

┌─────────────────────────────────────────────────────────────────┐
│  serenata-toolbox (pip)                                         │
│  Downloads reimbursement CSVs + companies.xz from open sources  │
└────────────────────────┬────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────────┐
│  Rosie  (Python / scikit-learn)                                 │
│  Adapter → Core → Classifiers → suspicions.xz                  │
└────────────────────────┬────────────────────────────────────────┘
                         │  suspicions.xz

┌─────────────────────────────────────────────────────────────────┐
│  Jarbas  (Django / DRF / Elm)                                   │
│  manage.py suspicions → PostgreSQL → REST API + Dashboard       │
└─────────────────────────────────────────────────────────────────┘

Rosie subsystem

Rosie is a pure Python application with no persistent service. It is run on demand and exits after writing suspicions.xz.

Module structure

rosie/
├── rosie.py                        # CLI entry point (docopt)
└── rosie/
    ├── core/
    │   ├── __init__.py             # Core pipeline class
    │   └── classifiers/
    │       └── invalid_cnpj_cpf_classifier.py
    ├── chamber_of_deputies/
    │   ├── __init__.py             # main() wires Adapter + Core
    │   ├── adapter.py              # Downloads + prepares dataset
    │   ├── settings.py             # CLASSIFIERS dict + UNIQUE_IDS
    │   └── classifiers/
    │       ├── election_expenses_classifier.py
    │       ├── irregular_companies_classifier.py
    │       ├── meal_price_outlier_classifier.py
    │       ├── monthly_subquota_limit_classifier.py
    │       └── traveled_speeds_classifier.py
    └── federal_senate/
        └── __init__.py             # Equivalent structure for Senate

Core pipeline class

rosie/rosie/core/__init__.py implements the Core class, which is the generic analysis engine:
class Core:
    def __init__(self, settings, adapter):
        self.settings = settings
        self.dataset = adapter.dataset      # cleaned pandas DataFrame
        self.data_path = adapter.path

    def __call__(self):
        for name, classifier in self.settings.CLASSIFIERS.items():
            model = self.load_trained_model(classifier)
            self.predict(model, name)        # writes column into suspicions df

        output = os.path.join(self.data_path, 'suspicions.xz')
        self.suspicions.to_csv(output, compression='xz', encoding='utf-8', index=False)
Trained models are persisted as .pkl files via joblib so subsequent runs skip the fit step. MonthlySubquotaLimitClassifier is exempt from caching because its serialized size exceeds joblib’s practical limits.

Adapter

The Adapter (rosie/rosie/chamber_of_deputies/adapter.py) is responsible for:
  1. Calling serenata-toolbox to download companies and per-year reimbursement CSVs
  2. Merging the two datasets on cnpj_cpf / cnpj
  3. Normalizing column names, document types, and date fields
from serenata_toolbox.chamber_of_deputies.reimbursements import Reimbursements
from serenata_toolbox.datasets import fetch

class Adapter:
    STARTING_YEAR = 2009

    def update_reimbursements(self, years=None):
        next_year = date.today().year + 1
        years = range(self.STARTING_YEAR, next_year)
        for year in years:
            Reimbursements(year, self.path)()

Classifiers

Each classifier lives in rosie/rosie/chamber_of_deputies/classifiers/ and inherits from scikit-learn’s TransformerMixin. The full set configured in settings.py:
Key in suspicions.xzClass
meal_price_outlierMealPriceOutlierClassifier
over_monthly_subquota_limitMonthlySubquotaLimitClassifier
suspicious_traveled_speed_dayTraveledSpeedsClassifier
invalid_cnpj_cpfInvalidCnpjCpfClassifier
election_expensesElectionExpensesClassifier
irregular_companies_classifierIrregularCompaniesClassifier
InvalidCnpjCpfClassifier lives in rosie/rosie/core/classifiers/ because it is shared between the Chamber of Deputies and Federal Senate modules.

External dependency: serenata-toolbox

Rosie depends on serenata-toolbox (a separate pip package at github.com/okfn-brasil/serenata-toolbox) for all I/O against government data sources. Rosie itself contains no direct HTTP calls to government APIs.

Jarbas subsystem

Jarbas is a long-running web application. It serves a REST API and an Elm-compiled frontend dashboard, backed by PostgreSQL and Celery/RabbitMQ for background task processing.

Components

Django + DRF

The web layer. Django REST Framework handles all API serialization and filtering. Gunicorn serves the WSGI application. URL routing is defined in jarbas/urls.py.

Elm frontend

The dashboard UI is compiled from Elm source. Static assets are generated by Node.js (npm run assets) and collected via Django’s collectstatic.

Celery workers

Background task workers (tasks service) process long-running import jobs dispatched by management commands. A beat service handles scheduled tasks such as the search vector refresh.

PostgreSQL

The primary datastore. Jarbas requires PostgreSQL specifically because it uses Django’s JSONField and full-text search via SearchVector.

URL structure

Defined in jarbas/urls.py:
urlpatterns = [
    path('dashboard/', include('jarbas.dashboard.urls')),
    path('layers/', include('jarbas.layers.urls', namespace='layers')),
    path('api/', include('jarbas.core.urls', namespace='core')),
    path('api/chamber_of_deputies/',
         include('jarbas.chamber_of_deputies.urls',
                 namespace='chamber_of_deputies')),
    path('healthcheck/', healthcheck, name='healthcheck'),
]
The /api/chamber_of_deputies/ prefix hosts the reimbursement, subquota, applicant, and company endpoints documented in the API reference.

Data loading via management commands

Jarbas does not pull data directly from Rosie. Data is loaded manually using Django management commands:
python manage.py reimbursements <path/to/reimbursements.csv>
python manage.py companies      <path/to/companies.xz>
python manage.py suspicions     <path/to/suspicions.xz>
python manage.py searchvector
python manage.py tweets
These commands read the compressed CSV files produced by Rosie (and the toolbox) and write records into PostgreSQL via Django ORM.

Docker Compose topology

The docker-compose.yml defines seven services:
ServiceImageRole
djangoserenata/djangoGunicorn WSGI server for the API and dashboard
tasksserenata/djangoCelery worker — processes import task queues
beatserenata/djangoCelery beat — runs scheduled jobs (e.g., search vector refresh)
elmserenata/elmCompiles and serves Elm frontend assets

Dependency graph

django
  └── depends_on: cache, elm, tasks
        tasks
          └── depends_on: queue (rabbitmq)
        beat
          └── depends_on: queue (rabbitmq)
rosie has no depends_on entries — it runs independently and shares a data volume (/tmp/serenata-data) with the host.

Running Rosie via Docker

docker run --rm \
  -v /tmp/serenata-data:/tmp/serenata-data \
  serenata/rosie \
  python rosie.py run chamber_of_deputies
The --rm flag removes the container after it exits. The volume mount makes suspicions.xz available on the host for subsequent import into Jarbas.

Data flow summary

serenata-toolbox

    │  reimbursements-<year>.csv, companies.xz

Adapter (rosie/chamber_of_deputies/adapter.py)

    │  merged + cleaned pandas DataFrame

Core (rosie/core/__init__.py)

    │  runs 6 classifiers sequentially

suspicions.xz   ◄── written to /tmp/serenata-data/

    │  manual copy to Jarbas /mnt/data/

manage.py suspicions

    │  INSERT rows into PostgreSQL

Django REST API + Elm dashboard

    │  JSON responses to browser / API clients

Twitter (@RosieDaSerenata)
PostgreSQL is the only mandatory database for Jarbas. The JSONField usage and SearchVector full-text search are PostgreSQL-specific features — other databases are not supported.

Build docs developers (and LLMs) love