Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/characat0/mlops-fundamentals-homework/llms.txt

Use this file to discover all available pages before exploring further.

process.py splits the raw Spotify dataset into two temporally-separated subsets using the song release year as the boundary. Tracks released on or before 2010 — the pre-streaming, CD/iTunes era — go into data/train.csv. Tracks released after 2010 — the Spotify streaming era — go into data/prod_sim.csv. This intentional distribution shift across the 2010 boundary is the signal that the drift monitoring component will later detect: audio feature distributions changed measurably when streaming economics reshaped what music gets made and released.

Function Signature

def process_data(
    input_path: str,
    train_output: str,
    prod_output: str,
    year_threshold: int = 2010
)
ArgumentTypeDefaultDescription
input_pathstrPath to data/raw.csv produced by load.py
train_outputstrDestination path for the training split (year <= year_threshold)
prod_outputstrDestination path for the production simulation split (year > year_threshold)
year_thresholdint2010Year boundary; rows at exactly this year go to train

What to Implement (TODO)

This function is a student exercise. The scaffold reads and logs the raw dataset; you must complete the two TODO blocks to produce the final CSV outputs. Step 1 — Split by year using boolean indexing:
train_df = df[df['year'] <= year_threshold]
prod_df  = df[df['year'] >  year_threshold]
logger.info(f"Train size: {len(train_df)}, Prod size: {len(prod_df)}")
Step 2 — Create output directories and save both splits:
os.makedirs(os.path.dirname(train_output), exist_ok=True)
os.makedirs(os.path.dirname(prod_output), exist_ok=True)
train_df.to_csv(train_output, index=False)
prod_df.to_csv(prod_output, index=False)
Boundary condition: a track released in exactly year_threshold (e.g. 2010) must land in the train set (<=), not in the prod-sim set. The test test_process_data_year_boundary_condition verifies this explicitly — year=2010 → train, year=2011 → prod.

DVC Stage

The process stage passes the year_threshold from params.yaml via ${data.train_year_threshold}:
process:
  cmd: python src/process.py --input_path data/raw.csv --train_output data/train.csv --prod_output data/prod_sim.csv --year_threshold ${data.train_year_threshold}
  deps:
    - data/raw.csv
    - src/process.py
  outs:
    - data/train.csv
    - data/prod_sim.csv
Both outputs are DVC-tracked with default caching enabled, so changing train_year_threshold in params.yaml will trigger a re-run of process and all downstream stages.

CLI Usage

Run the stage directly without DVC:
python src/process.py \
  --input_path data/raw.csv \
  --train_output data/train.csv \
  --prod_output data/prod_sim.csv \
  --year_threshold 2010
--year_threshold is optional and defaults to 2010 if omitted.

Tests

Three unit tests in tests/test_process.py cover the splitting contract:
TestWhat it verifies
test_process_data_temporal_splitGiven 5 rows spanning 2005–2015, exactly 3 land in train (≤ 2010) and 2 in prod (> 2010)
test_process_data_preserves_audio_featuresAll 12 expected audio feature columns (danceability, energy, key, loudness, mode, speechiness, acousticness, instrumentalness, liveness, valence, tempo, duration_ms) are present in both output files
test_process_data_year_boundary_conditionyear=2010 appears in train, year=2011 appears in prod; the exact sets {2009, 2010} and {2011, 2012} are verified
Run the process tests in isolation:
pytest tests/test_process.py

Build docs developers (and LLMs) love