Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/characat0/mlops-fundamentals-homework/llms.txt

Use this file to discover all available pages before exploring further.

The model in this homework is trained on the 550k Spotify Songs dataset published on Kaggle by Serkan Tysz. It contains approximately 550,000 tracks drawn from multiple decades of recorded music, each annotated with Spotify’s proprietary audio feature measurements and a manually assigned top-level genre label. The combination of scale, feature richness, and a wide temporal range makes it an ideal dataset for both multi-class genre classification and the realistic simulation of data drift across the 2010 streaming-era boundary.

Dataset Overview

PropertyValue
Total tracks~550,000
Audio features12 numeric columns
Genre classes10
Temporal rangeMultiple decades of music
Training split (year ≤ 2010)CD/iTunes era tracks
Production split (year > 2010)Streaming era tracks
File formatCSV (songs.csv)
Approximate download size~2–3 GB

Audio Features

Your model must use the following 12 numeric columns as input features. All values are computed by Spotify’s audio analysis engine and represent objective measurements of the audio signal.
FeatureTypeRangeDescription
danceabilityfloat0–1How suitable the track is for dancing; combines tempo, rhythm stability, beat strength, and regularity
energyfloat0–1Perceptual measure of intensity and activity; energetic tracks feel fast, loud, and noisy
keyint0–11Estimated overall pitch class of the track using standard Pitch Class notation (0 = C, 1 = C#/D♭, …)
loudnessfloatdBOverall loudness in decibels, averaged across the entire track; typical range is −60 to 0 dB
modeint0 or 1Modality of the track: major (1) or minor (0)
speechinessfloat0–1Presence of spoken words; values above 0.66 indicate almost entirely speech
acousticnessfloat0–1Confidence that the track is acoustic; 1.0 represents high confidence
instrumentalnessfloat0–1Predicts whether the track contains no vocals; values above 0.5 represent instrumental tracks
livenessfloat0–1Detects the presence of an audience; values above 0.8 suggest a live recording
valencefloat0–1Musical positiveness; high valence sounds happy and cheerful, low valence sounds sad or angry
tempofloatBPMEstimated overall tempo in beats per minute
duration_msintmsDuration of the track in milliseconds
The dataset also contains metadata columns (id, name, album_name, artists, lyrics) and popularity metrics (popularity, total_artist_followers, avg_artist_popularity). These are not required as model input features but you may experiment with including them. The year column is used exclusively for the temporal split and must be dropped before training.

Genre Classes

The genre column contains one of the following 10 string labels. Your training pipeline encodes these with LabelEncoder before fitting a model.
#Genre
1Rock
2Pop
3Electronic
4Folk
5Country
6Hip-Hop
7R&B
8Jazz
9Blues
10Classical

Downloading the Dataset

The dataset requires a free Kaggle account and API credentials. Dataset page: https://www.kaggle.com/datasets/serkantysz/550k-spotify-songs-audio-lyrics-and-genres
1

Set up Kaggle credentials

Download your API key from kaggle.com/settings/account and save it as ~/.kaggle/kaggle.json, then authenticate:
pip install kaggle
kaggle auth
2

Download and extract the dataset

kaggle datasets download -d serkantysz/550k-spotify-songs-audio-lyrics-and-genres
unzip 550k-spotify-songs-audio-lyrics-and-genres.zip
3

Move to the expected location

The DVC pipeline reads the file from data_pipeline/songs.csv. The filename must be exactly songs.csv.
mv songs.csv data_pipeline/songs.csv
After placing the file, verify it with DVC to confirm you have the correct version:
cd data_pipeline
dvc status songs.csv.dvc
Run dvc repro and check that dvc.lock records the hash of your songs.csv. If the hash does not match the course-provided .dvc file, the grader’s dataset-integrity check will fail.

Build docs developers (and LLMs) love