Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/characat0/mlops-fundamentals-homework/llms.txt

Use this file to discover all available pages before exploring further.

train.py is the heart of the ML pipeline. It reads the temporal training split produced by process.py, encodes genre labels, scales features, and trains two classifiers — Logistic Regression and XGBoost — capturing every run in MLflow so you can compare hyperparameter experiments side-by-side. The script is an intentionally incomplete student skeleton: the scaffold handles data loading, feature selection commentary, and the training loop pseudocode; students fill in the TODO blocks to wire it all together.

Function Signature

def train(data_path: str, params: dict)
ArgumentTypeDescription
data_pathstrPath to the training CSV (data/train.csv) produced by process.py
paramsdictParsed contents of params.yaml, including train hyperparameter blocks

Implementation Steps

1

Load training data

Read data/train.csv into a pandas DataFrame using pd.read_csv(data_path). Log the resulting shape so you can sanity-check the split size.
2

Separate features from target

Drop the genre and year columns from the DataFrame to form the feature matrix X. The target vector y is the genre column. Use errors='ignore' on the drop in case a column is absent.
X = df.drop(["genre", "year"], axis=1, errors='ignore')
y = df["genre"]
3

Encode genre labels

The dataset contains 10 distinct genre classes that must be converted to integers (0–9) before passing them to scikit-learn models. Use LabelEncoder from sklearn.preprocessing:
le = LabelEncoder()
y_encoded = le.fit_transform(y)
4

Scale features

Logistic Regression is sensitive to feature scale; apply StandardScaler to produce X_scaled. XGBoost handles feature scaling internally, so use the original X with that model.
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
5

Train each model with MLflow logging

Iterate over each model configuration in params['train']. For each model, open an MLflow run, log its hyperparameters, fit the model, calculate accuracy, and log the trained artifact. See the training loop pseudocode in the source file for the full pattern.

Hyperparameters

All hyperparameters are read from params.yaml at runtime. Changing a value there and re-running dvc repro will trigger a new set of MLflow runs automatically.
ModelParameterDefault
logistic_regressionC1.0
logistic_regressionmax_iter1000
xgboostmax_depth6
xgboostlearning_rate0.1
xgboostn_estimators100

MLflow Logging

Each model type produces one MLflow run. The following is logged per run:
ItemMLflow callValue
Hyperparametersmlflow.log_params(model_params)All key/value pairs from the model’s params.yaml block
Accuracymlflow.log_metric("accuracy", accuracy)Training-set accuracy
Model artifact (LR)mlflow.sklearn.log_model(model, artifact_path="model")Serialised scikit-learn pipeline
Model artifact (XGB)mlflow.xgboost.log_model(model, artifact_path="model")Serialised XGBoost booster
Run namemlflow.start_run(run_name=model_name)"logistic_regression" or "xgboost"

Key Implementation Notes

  • Use X_scaled when fitting LogisticRegression; use raw X when fitting xgb.XGBClassifier.
  • The artifact_path must be "model" for both log calls. evaluate.py constructs the model URI as runs:/<run_id>/model, so any other path will break champion registration.
  • Use mlflow.sklearn.log_model for Logistic Regression and mlflow.xgboost.log_model for XGBoost to preserve the correct model flavour in the registry.

DVC Stage

train:
  cmd: python src/train.py --data_path data/train.csv --params_path params.yaml
  deps:
    - data/train.csv
    - src/train.py
  params:
    - train:
  outs:
    - models/:
        cache: false
The params: - train: declaration tells DVC to watch the entire train block in params.yaml. Any hyperparameter change will mark this stage as stale.

CLI Usage

python src/train.py --data_path data/train.csv --params_path params.yaml
--params_path defaults to params.yaml if omitted.
After training, run mlflow ui and open http://localhost:5000 to compare all experiment runs in a browser. You can sort by accuracy, diff hyperparameters between runs, and download any logged model artifact directly from the UI.

Build docs developers (and LLMs) love