Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/characat0/mlops-fundamentals-homework/llms.txt

Use this file to discover all available pages before exploring further.

One of the most important design decisions in this homework is that the dataset is split by release year, not randomly. Spotify launched in Sweden in 2008 and in the United States in 2011. By 2010 the streaming era was beginning to reshape how music was produced, mixed, and mastered — songwriters and producers started optimising for short attention spans, algorithmic playlists, and compressed audio formats. This means that tracks released before and after 2010 have measurably different audio feature distributions, making the year boundary an ideal and realistic source of data drift for you to detect.

The Split Logic

The process stage reads year_threshold from params.yaml and uses it to partition data/raw.csv into two CSV files:
  • data/train.csv — tracks where year <= 2010 (CD/iTunes era)
  • data/prod_sim.csv — tracks where year > 2010 (streaming era)
The skeleton in src/process.py already wires up the argument parsing and file paths. Your job is to implement the filtering and saving logic inside process_data().
# src/process.py  (simplified illustration)
train_df = df[df["year"] <= year_threshold]
prod_df  = df[df["year"] >  year_threshold]

train_df.to_csv(train_output, index=False)
prod_df.to_csv(prod_output,   index=False)
The boundary condition is inclusive on the training side: tracks from exactly 2010 go into train.csv (<=), and tracks from 2011 onwards go into prod_sim.csv (>). Do not use < 2010 or >= 2011 — off-by-one errors here shift thousands of tracks across the boundary and change the drift statistics the grader checks.

Why Audio Features Drift

The table below shows the direction and approximate magnitude of drift in several key features across the 2010 boundary. These are the signals your Kolmogorov–Smirnov tests are expected to flag.
FeaturePre-2010 (CD/iTunes era)Post-2010 (Streaming era)Drift direction
loudnessLowerHigher (+1.56 dB average)Increases — loudness wars and heavy compression
acousticnessHigherLower (−5.75%)Decreases — more synthesised production
valenceHigherLower (−6.5%)Decreases — music became moodier
energyLowerHigher (+4.3%)Increases — more intense production
duration_msLongerShorter (−8.4 s average)Decreases — streaming optimisation
danceabilityLowerHigherIncreases — algorithmic playlist compatibility
These shifts are statistically significant at the default p < 0.05 threshold used in the drift monitoring script, meaning every feature in the table above should appear in the drifted_features list of your batch drift report.

Configuring the Threshold

The year threshold is controlled by a single parameter in data_pipeline/params.yaml. DVC reads this value at pipeline execution time and passes it to process.py via a command-line argument — so changing the value and re-running dvc repro will automatically re-process the data.
# data_pipeline/params.yaml
data:
  source_path: "songs.csv"
  train_year_threshold: 2010  # Pre-2010 = training, Post-2010 = production simulation
The grading rubric checks that train_year_threshold is set to exactly 2010 in params.yaml. Do not change this value. The DVC pipeline command for the process stage passes it as --year_threshold ${data.train_year_threshold}, so updating the YAML is all that is needed if you ever need to experiment.

Downstream Effects

Understanding where each split file travels through the rest of the system helps you debug issues end-to-end:
  • data/train.csvsrc/train.py (model training features and labels)
  • data/train.csvsrc/evaluate.py (reference distribution for evaluation)
  • data/train.csvdrift_monitoring/src/analyze_drift.py (reference distribution for KS tests)
  • data/prod_sim.csvdrift_monitoring/src/analyze_drift.py --mode batch (production simulation for batch drift)

Build docs developers (and LLMs) love