Rosie is invoked through rosie.py, a small CLI wrapper built with docopt. It supports two sub-commands: run (produce suspicions) and test (run the test suite).
Usage:
rosie.py run (chamber_of_deputies|federal_senate) [--output=<directory>]
rosie.py test [chamber_of_deputies|federal_senate|core]
Running with Docker (Recommended)
Docker is the simplest way to run Rosie because all Python dependencies and the correct environment are bundled in the image.
Pull or build the image
The pre-built image is serenata/rosie. No extra configuration is needed.
Run the analysis
Mount a host directory to /tmp/serenata-data so the output file is accessible after the container exits. Chamber of Deputies
Federal Senate
docker run --rm \
-v /tmp/serenata-data:/tmp/serenata-data \
serenata/rosie \
python rosie.py run chamber_of_deputies
docker run --rm \
-v /tmp/serenata-data:/tmp/serenata-data \
serenata/rosie \
python rosie.py run federal_senate
Retrieve the output
After the container exits, the results are at:/tmp/serenata-data/suspicions.xz
Running tests with Docker
docker run --rm \
-v /tmp/serenata-data:/tmp/serenata-data \
serenata/rosie \
python rosie.py test
Running without Docker
Install Anaconda
Download and install Anaconda for your platform. Create and activate the environment
conda update conda
conda create --name serenata python=3
conda activate serenata
Install Python dependencies
pip install -r requirements.txt
scipy is listed before scikit-learn in requirements.txt intentionally — this ordering is required for the wheel to build correctly.
Run the analysis
Chamber of Deputies
Federal Senate
python rosie.py run chamber_of_deputies
python rosie.py run federal_senate
Output is written to /tmp/serenata-data/suspicions.xz by default.
Custom output directory
Use the --output flag to write the output to a different location:
python rosie.py run chamber_of_deputies --output /my/serenata/directory/
The directory will be created automatically if it does not exist.
Running Tests
All tests
Core module only
Chamber of Deputies only
Federal Senate only
python rosie.py test core
python rosie.py test chamber_of_deputies
python rosie.py test federal_senate
Tests are discovered automatically using unittest.TestLoader.discover starting from the rosie/ directory (or a subdirectory when a module name is passed). The runner exits with code 1 if any test fails.
def test(module=None):
loader = unittest.TestLoader()
tests_path = 'rosie'
if module:
tests_path = os.path.join(tests_path, module)
tests = loader.discover(tests_path)
testRunner = unittest.runner.TextTestRunner()
result = testRunner.run(tests)
if not result.wasSuccessful():
exit(1)
Output File
After a successful run, Rosie produces:
<output_directory>/suspicions.xz
This is a UTF-8 CSV compressed with xz. Each row corresponds to one reimbursement and includes:
- The unique identifiers for the reimbursement (
applicant_id, year, document_id for Chamber of Deputies)
- One boolean column per classifier (e.g.
meal_price_outlier, invalid_cnpj_cpf)
The file is written by the Core engine:
output = os.path.join(self.data_path, 'suspicions.xz')
kwargs = dict(compression='xz', encoding='utf-8', index=False)
self.suspicions.to_csv(output, **kwargs)
Trained classifier models (except MonthlySubquotaLimitClassifier) are cached as .pkl files in the output directory using joblib. Re-running Rosie will reuse these cached models, making subsequent runs faster.