Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/characat0/mlops-fundamentals-homework/llms.txt

Use this file to discover all available pages before exploring further.

The MLOps Fundamentals Homework takes you from raw Kaggle data to a fully deployed, monitored machine learning system. Working through four stages, you will implement a DVC-orchestrated training pipeline, track experiments with MLflow, serve predictions through a FastAPI application packaged in Docker, and detect data drift using Kolmogorov-Smirnov tests — exactly as you would in a real production MLOps role.

Introduction

Understand the homework structure, learning objectives, and what you will build.

Setup

Install dependencies, configure environment variables, and download the dataset.

Project Structure

Explore the monorepo layout and understand how the three subsystems connect.

How to Submit

Fork the repo, open a PR, and get a green CI checkmark before the deadline.

The Four Stages

1

Data Pipeline (6 pts)

Use DVC to orchestrate a four-stage pipeline: load → process → train → evaluate. Split the 550k Spotify Songs dataset at the 2010 streaming era boundary, train Logistic Regression and XGBoost classifiers, and register the champion model in the MLflow Model Registry.
2

Model Serving (5 pts)

Implement a FastAPI application that exposes GET /health and POST /predict endpoints. Add a logging middleware that writes every prediction request to a JSONL file, and containerize the service with Docker — baking the champion model into the image at build time.
3

Drift Monitoring (3 pts)

Run Kolmogorov-Smirnov tests across all 12 audio features in two modes: batch (comparing train vs. production CSV splits) and online (comparing training data against live API request logs).
4

Testing & CI/CD (4 pts)

Every push to your pull request triggers GitHub Actions — flake8 linting and pytest for both the data pipeline and model serving modules. Unit tests earn 2 pts, code quality 1 pt, and a green Actions checkmark 1 pt.

Key Technologies

DVC

Data version control and pipeline orchestration via dvc.yaml and params.yaml.

MLflow

Experiment tracking, model registry, and the @champion alias for deployment.

FastAPI

Async REST API with Pydantic validation and HTTP middleware for request logging.

Docker

Self-contained container image with the champion model baked in at build time.

scikit-learn & XGBoost

Logistic Regression and XGBoost classifiers with StandardScaler preprocessing.

SciPy KS Test

scipy.stats.ks_2samp to detect distribution shift across 12 audio features.
Your grade depends on a passing CI run on your pull request. Implement all TODOs, push your changes, and verify the GitHub Actions checkmark before submitting.

Build docs developers (and LLMs) love