This page walks you through everything you need to do before running a single line of the pipeline. You will install the three independent requirements files, copy and populate theDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/characat0/mlops-fundamentals-homework/llms.txt
Use this file to discover all available pages before exploring further.
.env configuration, start the MLflow tracking server, download the Kaggle dataset, and verify that DVC sees the correct file hash. Complete each step in order — every subsequent stage in the pipeline depends on the outputs produced here.
Prerequisites
Make sure the following are available on your machine before you begin.| Requirement | Minimum Version | Notes |
|---|---|---|
| Python | 3.9+ | python --version to check |
| pip | Latest | Bundled with Python |
| Git | Any recent | Needed to fork and clone the repo |
| Docker | 20.10+ | Required for the model-serving container |
| Kaggle account | — | Free at kaggle.com; API key required |
You will need approximately 2–3 GB of free disk space to store the raw dataset, DVC-tracked outputs, and MLflow model artifacts.
Install and Configure
Install the data pipeline dependencies
The data pipeline requirements cover pandas, scikit-learn, XGBoost, MLflow, DVC, and PyYAML.
Install the model serving dependencies
The model serving requirements add FastAPI, Uvicorn, and Pydantic v2 on top of the shared ML stack.
Install the drift monitoring dependencies
The drift monitoring requirements add SciPy for the Kolmogorov–Smirnov tests.
Install the Kaggle CLI
The Kaggle CLI is not listed in any requirements file because it is only needed for the one-time dataset download.
Configure environment variables
Copy the provided example file and edit it to match your environment.The Load the variables into your current shell session:
.env file contains the following variables:Start the MLflow tracking server
Open a separate terminal and leave this process running for the duration of your work session. All training runs and model registrations are sent to this server.The MLflow UI will be available at http://localhost:5000.
Download the dataset from Kaggle
The model is trained on the 550k Spotify Songs dataset:
https://www.kaggle.com/datasets/serkantysz/550k-spotify-songs-audio-lyrics-and-genresFirst authenticate with the Kaggle API (you will need credentials from kaggle.com/settings/account saved as Then download and unzip the dataset:Move the CSV into the expected location:The DVC pipeline expects the file at
~/.kaggle/kaggle.json):data_pipeline/songs.csv. The filename must be exactly songs.csv.Verify dataset integrity with DVC
Change into the Once the file is verified, run the full pipeline:DVC will execute the four stages in order —
data_pipeline/ directory and run dvc status to confirm DVC recognises the file you downloaded.After running
dvc repro, the hash for songs.csv is recorded in dvc.lock. If dvc status songs.csv.dvc shows a mismatch, you have a different version of the dataset and the grader’s hash check will fail — re-download from the Kaggle link above.load, process, train, evaluate — and cache each output for future runs.