The homework repository is a monorepo containing three independently runnable subsystems that share no Python packages at runtime but are linked by data artefacts and the MLflow Model Registry. Understanding the layout before you start coding will save you from path errors, missing-file surprises, and confusing CI failures. Each top-level directory has its ownDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/characat0/mlops-fundamentals-homework/llms.txt
Use this file to discover all available pages before exploring further.
requirements.txt, its own test suite, and its own responsibility in the ML lifecycle.
Directory Tree
Data Flow
Data moves through the system in a linear chain. Each step produces an artefact that feeds the next.Raw CSV ingestion
You download
songs.csv from Kaggle and place it in data_pipeline/. DVC tracks the file via songs.csv.dvc, recording its MD5 hash so any accidental modification is caught before the pipeline runs.DVC pipeline execution
Running
dvc repro inside data_pipeline/ triggers the four-stage pipeline defined in dvc.yaml:- load — copies
songs.csv→data/raw.csv - process — splits
data/raw.csvonyear <= 2010→data/train.csvanddata/prod_sim.csv - train — trains Logistic Regression and XGBoost, logging every run to MLflow
- evaluate — queries MLflow for the highest-accuracy run, registers the model, and assigns the
@championalias
MLflow Model Registry
The
evaluate stage calls the MLflow client API to create a registered model named spotify-genre-classifier and sets the @champion alias on the best version. From this point on, any process can reference the model by alias rather than by run ID.Docker container build
The
model_serving/Dockerfile accepts MLFLOW_TRACKING_URI as a build argument and runs mlflow models download -m "models:/spotify-genre-classifier@champion" to bake the champion model into the image at build time. The running container needs no live MLflow connection.FastAPI inference
The containerised FastAPI application exposes
POST /predict. Each request is validated by a Pydantic schema and routed to the baked-in model. The response includes the predicted genre and a confidence score.JSONL request logging
A FastAPI middleware intercepts every prediction request and appends a timestamped JSON line to
model_serving/logs/api_requests.jsonl. This file is the input for online drift analysis.Key Configuration Files
| File | Location | Purpose |
|---|---|---|
dvc.yaml | data_pipeline/dvc.yaml | Declares the four pipeline stages, their commands, dependencies, parameters, and outputs. Edit this if you add a new pipeline step. |
params.yaml | data_pipeline/params.yaml | Stores the train_year_threshold (2010) and all model hyperparameters (C, max_depth, learning_rate, etc.). DVC tracks parameter changes and re-runs only affected stages. |
.env | Repository root | Sets MLFLOW_TRACKING_URI (and optionally DVC_REMOTE_URL). Created by copying .env.example. Never commit this file. |
ci.yml | .github/workflows/ci.yml | Runs flake8, pytest data_pipeline/tests, and pytest model_serving/tests on every push to a PR. A green check is worth 1 grading point. |
Component Overview
Data Pipeline
DVC-orchestrated four-stage pipeline that takes the raw Kaggle CSV from ingestion through temporal splitting, model training, and champion registration.
Model Serving
FastAPI application packaged in Docker. Validates audio features with Pydantic, runs inference against the baked-in champion model, and logs every request for drift analysis.
Drift Monitoring
Offline KS-test script that compares training-data distributions to either the held-out production split or live API logs, flagging features whose distribution has shifted.
CI/CD
GitHub Actions workflow that lints with
flake8 and runs both test suites automatically on every pull request push.