Documentation Index
Fetch the complete documentation index at: https://mintlify.com/characat0/mlops-fundamentals-homework/llms.txt
Use this file to discover all available pages before exploring further.
load.py has a single, deliberate responsibility: read the raw Kaggle CSV from disk and save it unchanged as data/raw.csv. No columns are dropped, no rows are filtered, and no transformations are applied at this stage. Keeping the load step pure makes it easy to swap the data source without touching any downstream logic — all filtering and temporal splitting are handled exclusively by process.py.
Function Signature
| Argument | Type | Description |
|---|---|---|
source_path | str | Path to the raw songs.csv downloaded from Kaggle |
output_path | str | Destination path where data/raw.csv will be written |
Implementation
The function follows three steps:- Creates any missing parent directories with
os.makedirs(os.path.dirname(output_path), exist_ok=True)so thedata/directory does not need to exist before running. - Reads the CSV with
pandas.read_csv(source_path)and logs the resulting shape and column list atINFOlevel so you can verify the correct file was loaded. - Writes the DataFrame back to disk with
df.to_csv(output_path, index=False), preserving every original column exactly as-is.
Complete Source
DVC Stage
Theload stage in dvc.yaml wires source_path from params.yaml directly into the CLI call:
cache: false means DVC tracks the file for change detection but does not store it in the DVC cache. This keeps the repository lightweight since data/raw.csv can always be regenerated from songs.csv.CLI Usage
The script can also be run directly outside of DVC:required=True in argparse). The script will exit with an error if either is missing.
Tests
Three unit tests intests/test_load.py cover the full contract of load_data:
| Test | What it verifies |
|---|---|
test_load_data_creates_output | The output CSV file is created on disk at the specified path |
test_load_data_preserves_all_columns | Every column from the source CSV is present in the output, with no additions or omissions |
test_load_data_row_count_unchanged | The number of rows in the output matches the input exactly |
load_data, and asserts the expected postcondition. No network access or the full Kaggle dataset is required.
Run the load tests in isolation: