Documentation Index
Fetch the complete documentation index at: https://mintlify.com/characat0/mlops-fundamentals-homework/llms.txt
Use this file to discover all available pages before exploring further.
process.py splits the raw Spotify dataset into two temporally-separated subsets using the song release year as the boundary. Tracks released on or before 2010 — the pre-streaming, CD/iTunes era — go into data/train.csv. Tracks released after 2010 — the Spotify streaming era — go into data/prod_sim.csv. This intentional distribution shift across the 2010 boundary is the signal that the drift monitoring component will later detect: audio feature distributions changed measurably when streaming economics reshaped what music gets made and released.
Function Signature
| Argument | Type | Default | Description |
|---|---|---|---|
input_path | str | — | Path to data/raw.csv produced by load.py |
train_output | str | — | Destination path for the training split (year <= year_threshold) |
prod_output | str | — | Destination path for the production simulation split (year > year_threshold) |
year_threshold | int | 2010 | Year boundary; rows at exactly this year go to train |
What to Implement (TODO)
This function is a student exercise. The scaffold reads and logs the raw dataset; you must complete the twoTODO blocks to produce the final CSV outputs.
Step 1 — Split by year using boolean indexing:
DVC Stage
Theprocess stage passes the year_threshold from params.yaml via ${data.train_year_threshold}:
train_year_threshold in params.yaml will trigger a re-run of process and all downstream stages.
CLI Usage
Run the stage directly without DVC:--year_threshold is optional and defaults to 2010 if omitted.
Tests
Three unit tests intests/test_process.py cover the splitting contract:
| Test | What it verifies |
|---|---|
test_process_data_temporal_split | Given 5 rows spanning 2005–2015, exactly 3 land in train (≤ 2010) and 2 in prod (> 2010) |
test_process_data_preserves_audio_features | All 12 expected audio feature columns (danceability, energy, key, loudness, mode, speechiness, acousticness, instrumentalness, liveness, valence, tempo, duration_ms) are present in both output files |
test_process_data_year_boundary_condition | year=2010 appears in train, year=2011 appears in prod; the exact sets {2009, 2010} and {2011, 2012} are verified |