The data pipeline is a four-stage, DVC-orchestrated workflow that takes a raw Kaggle Spotify CSV all the way through to a registered champion model in MLflow. Each stage has a single responsibility and hands its output to the next: load ingests the raw file, process splits it into temporal train/prod sets, train fits Logistic Regression and XGBoost classifiers with MLflow tracking (student-implemented), and evaluate picks the winner and registers it in the Model Registry (student-implemented).Documentation Index
Fetch the complete documentation index at: https://mintlify.com/characat0/mlops-fundamentals-homework/llms.txt
Use this file to discover all available pages before exploring further.
Pipeline Stages
| Stage | Script | Input | Output |
|---|---|---|---|
load | src/load.py | songs.csv | data/raw.csv |
process | src/process.py | data/raw.csv | data/train.csv, data/prod_sim.csv |
train | src/train.py | data/train.csv | models/ |
evaluate | src/evaluate.py | models/ | metrics.json |
Running the Pipeline
Use standard DVC commands from inside thedata_pipeline/ directory:
Parameterization
All hyperparameters and data configuration live inparams.yaml. Changing any value there will cause DVC to mark dependent stages as stale on the next dvc status or dvc repro.
data.train_year_threshold controls the temporal boundary used by process.py to create the intentional distribution shift between the training set (pre-streaming era) and the production simulation set (Spotify streaming era).
Stage Definitions (dvc.yaml)
The full pipeline graph is declared indvc.yaml:
Dependencies
Install all required packages before running the pipeline:| Package | Version |
|---|---|
pandas | >=2.0.0 |
scikit-learn | >=1.2.0 |
xgboost | >=1.7.0 |
mlflow | >=2.3.0 |
dvc | >=3.0.0 |
pyyaml | >=6.0 |
Stage Pages
Load Stage
Ingest the raw Kaggle CSV and write
data/raw.csv with all columns intact.Process Stage
Split
data/raw.csv by release year into temporal train and prod-sim sets.Train Stage
Train Logistic Regression and XGBoost classifiers with MLflow experiment tracking.
Evaluate Stage
Select the best run and register it as the
@champion model in MLflow.