One of the most important design decisions in this homework is that the dataset is split by release year, not randomly. Spotify launched in Sweden in 2008 and in the United States in 2011. By 2010 the streaming era was beginning to reshape how music was produced, mixed, and mastered — songwriters and producers started optimising for short attention spans, algorithmic playlists, and compressed audio formats. This means that tracks released before and after 2010 have measurably different audio feature distributions, making the year boundary an ideal and realistic source of data drift for you to detect.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/characat0/mlops-fundamentals-homework/llms.txt
Use this file to discover all available pages before exploring further.
The Split Logic
Theprocess stage reads year_threshold from params.yaml and uses it to partition data/raw.csv into two CSV files:
data/train.csv— tracks whereyear <= 2010(CD/iTunes era)data/prod_sim.csv— tracks whereyear > 2010(streaming era)
src/process.py already wires up the argument parsing and file paths. Your job is to implement the filtering and saving logic inside process_data().
Why Audio Features Drift
The table below shows the direction and approximate magnitude of drift in several key features across the 2010 boundary. These are the signals your Kolmogorov–Smirnov tests are expected to flag.| Feature | Pre-2010 (CD/iTunes era) | Post-2010 (Streaming era) | Drift direction |
|---|---|---|---|
loudness | Lower | Higher (+1.56 dB average) | Increases — loudness wars and heavy compression |
acousticness | Higher | Lower (−5.75%) | Decreases — more synthesised production |
valence | Higher | Lower (−6.5%) | Decreases — music became moodier |
energy | Lower | Higher (+4.3%) | Increases — more intense production |
duration_ms | Longer | Shorter (−8.4 s average) | Decreases — streaming optimisation |
danceability | Lower | Higher | Increases — algorithmic playlist compatibility |
p < 0.05 threshold used in the drift monitoring script, meaning every feature in the table above should appear in the drifted_features list of your batch drift report.
Configuring the Threshold
The year threshold is controlled by a single parameter indata_pipeline/params.yaml. DVC reads this value at pipeline execution time and passes it to process.py via a command-line argument — so changing the value and re-running dvc repro will automatically re-process the data.
The grading rubric checks that
train_year_threshold is set to exactly 2010 in params.yaml. Do not change this value. The DVC pipeline command for the process stage passes it as --year_threshold ${data.train_year_threshold}, so updating the YAML is all that is needed if you ever need to experiment.Downstream Effects
Understanding where each split file travels through the rest of the system helps you debug issues end-to-end:data/train.csv→src/train.py(model training features and labels)data/train.csv→src/evaluate.py(reference distribution for evaluation)data/train.csv→drift_monitoring/src/analyze_drift.py(reference distribution for KS tests)data/prod_sim.csv→drift_monitoring/src/analyze_drift.py --mode batch(production simulation for batch drift)