Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/characat0/mlops-fundamentals-homework/llms.txt

Use this file to discover all available pages before exploring further.

This page walks you through everything you need to do before running a single line of the pipeline. You will install the three independent requirements files, copy and populate the .env configuration, start the MLflow tracking server, download the Kaggle dataset, and verify that DVC sees the correct file hash. Complete each step in order — every subsequent stage in the pipeline depends on the outputs produced here.

Prerequisites

Make sure the following are available on your machine before you begin.
RequirementMinimum VersionNotes
Python3.9+python --version to check
pipLatestBundled with Python
GitAny recentNeeded to fork and clone the repo
Docker20.10+Required for the model-serving container
Kaggle accountFree at kaggle.com; API key required
You will need approximately 2–3 GB of free disk space to store the raw dataset, DVC-tracked outputs, and MLflow model artifacts.

Install and Configure

1

Create a virtual environment

Isolate project dependencies from your system Python.
python -m venv venv
source venv/bin/activate        # Windows: venv\Scripts\activate
2

Install the data pipeline dependencies

The data pipeline requirements cover pandas, scikit-learn, XGBoost, MLflow, DVC, and PyYAML.
pip install -r data_pipeline/requirements.txt
3

Install the model serving dependencies

The model serving requirements add FastAPI, Uvicorn, and Pydantic v2 on top of the shared ML stack.
pip install -r model_serving/requirements.txt
4

Install the drift monitoring dependencies

The drift monitoring requirements add SciPy for the Kolmogorov–Smirnov tests.
pip install -r drift_monitoring/requirements.txt
5

Install the Kaggle CLI

The Kaggle CLI is not listed in any requirements file because it is only needed for the one-time dataset download.
pip install kaggle
6

Configure environment variables

Copy the provided example file and edit it to match your environment.
cp .env.example .env
The .env file contains the following variables:
# MLflow tracking server URI
# Default: http://localhost:5000 (local server)
MLFLOW_TRACKING_URI=http://localhost:5000

# Optional: DVC remote storage (if using cloud storage)
# DVC_REMOTE_URL=s3://your-bucket/dvc-storage
Load the variables into your current shell session:
source .env
7

Start the MLflow tracking server

Open a separate terminal and leave this process running for the duration of your work session. All training runs and model registrations are sent to this server.
mlflow server --host 0.0.0.0 --port 5000
The MLflow UI will be available at http://localhost:5000.
8

Download the dataset from Kaggle

The model is trained on the 550k Spotify Songs dataset: https://www.kaggle.com/datasets/serkantysz/550k-spotify-songs-audio-lyrics-and-genresFirst authenticate with the Kaggle API (you will need credentials from kaggle.com/settings/account saved as ~/.kaggle/kaggle.json):
kaggle auth
Then download and unzip the dataset:
kaggle datasets download -d serkantysz/550k-spotify-songs-audio-lyrics-and-genres
unzip 550k-spotify-songs-audio-lyrics-and-genres.zip
Move the CSV into the expected location:
mv songs.csv data_pipeline/songs.csv
The DVC pipeline expects the file at data_pipeline/songs.csv. The filename must be exactly songs.csv.
9

Verify dataset integrity with DVC

Change into the data_pipeline/ directory and run dvc status to confirm DVC recognises the file you downloaded.
cd data_pipeline
dvc status songs.csv.dvc
After running dvc repro, the hash for songs.csv is recorded in dvc.lock. If dvc status songs.csv.dvc shows a mismatch, you have a different version of the dataset and the grader’s hash check will fail — re-download from the Kaggle link above.
Once the file is verified, run the full pipeline:
dvc repro
DVC will execute the four stages in order — load, process, train, evaluate — and cache each output for future runs.

Build docs developers (and LLMs) love