Environment Setup for the MLOps Fundamentals Homework

This page walks you through everything you need to do before running a single line of the pipeline. You will install the three independent requirements files, copy and populate the .env configuration, start the MLflow tracking server, download the Kaggle dataset, and verify that DVC sees the correct file hash. Complete each step in order — every subsequent stage in the pipeline depends on the outputs produced here.

Prerequisites

Make sure the following are available on your machine before you begin.

Requirement	Minimum Version	Notes
Python	3.9+	`python --version` to check
pip	Latest	Bundled with Python
Git	Any recent	Needed to fork and clone the repo
Docker	20.10+	Required for the model-serving container
Kaggle account	—	Free at kaggle.com; API key required

You will need approximately 2–3 GB of free disk space to store the raw dataset, DVC-tracked outputs, and MLflow model artifacts.

Install and Configure

Create a virtual environment

Isolate project dependencies from your system Python.

python -m venv venv
source venv/bin/activate        # Windows: venv\Scripts\activate

Install the data pipeline dependencies

The data pipeline requirements cover pandas, scikit-learn, XGBoost, MLflow, DVC, and PyYAML.

pip install -r data_pipeline/requirements.txt

Install the model serving dependencies

The model serving requirements add FastAPI, Uvicorn, and Pydantic v2 on top of the shared ML stack.

pip install -r model_serving/requirements.txt

Install the drift monitoring dependencies

The drift monitoring requirements add SciPy for the Kolmogorov–Smirnov tests.

pip install -r drift_monitoring/requirements.txt

Install the Kaggle CLI

The Kaggle CLI is not listed in any requirements file because it is only needed for the one-time dataset download.

pip install kaggle

Configure environment variables

Copy the provided example file and edit it to match your environment.

cp .env.example .env

The .env file contains the following variables:

# MLflow tracking server URI
# Default: http://localhost:5000 (local server)
MLFLOW_TRACKING_URI=http://localhost:5000

# Optional: DVC remote storage (if using cloud storage)
# DVC_REMOTE_URL=s3://your-bucket/dvc-storage

Load the variables into your current shell session:

source .env

Start the MLflow tracking server

Open a separate terminal and leave this process running for the duration of your work session. All training runs and model registrations are sent to this server.

mlflow server --host 0.0.0.0 --port 5000

The MLflow UI will be available at http://localhost:5000.

Download the dataset from Kaggle

The model is trained on the 550k Spotify Songs dataset: https://www.kaggle.com/datasets/serkantysz/550k-spotify-songs-audio-lyrics-and-genresFirst authenticate with the Kaggle API (you will need credentials from kaggle.com/settings/account saved as ~/.kaggle/kaggle.json):

kaggle auth

Then download and unzip the dataset:

kaggle datasets download -d serkantysz/550k-spotify-songs-audio-lyrics-and-genres
unzip 550k-spotify-songs-audio-lyrics-and-genres.zip

Move the CSV into the expected location:

mv songs.csv data_pipeline/songs.csv

The DVC pipeline expects the file at data_pipeline/songs.csv. The filename must be exactly songs.csv.

Verify dataset integrity with DVC

Change into the data_pipeline/ directory and run dvc status to confirm DVC recognises the file you downloaded.

cd data_pipeline
dvc status songs.csv.dvc

After running dvc repro, the hash for songs.csv is recorded in dvc.lock. If dvc status songs.csv.dvc shows a mismatch, you have a different version of the dataset and the grader’s hash check will fail — re-download from the Kaggle link above.

Once the file is verified, run the full pipeline:

dvc repro

DVC will execute the four stages in order — load, process, train, evaluate — and cache each output for future runs.

Getting Started

Concepts

Environment Setup for the MLOps Fundamentals Homework

Prerequisites

Install and Configure

Build docs developers (and LLMs) love

Getting Started

Concepts

Documentation Index

​Prerequisites

​Install and Configure

Build docs developers (and LLMs) love

Prerequisites

Install and Configure