MLflow Tracking

Overview

ML Experiment Autopilot automatically logs all experiments to MLflow, providing a web UI for exploring metrics, parameters, and artifacts. This guide covers how to launch and navigate the MLflow interface.

Quick Start

Run Experiments

Complete at least one experiment run:

python -m src.main run \
  --data data/sample/california_housing.csv \
  --target MedHouseVal \
  --task regression \
  --max-iterations 3

Launch MLflow UI

Start the MLflow server with the correct backend URI:

mlflow ui --backend-store-uri file:./outputs/mlruns

The --backend-store-uri flag is required! Without it, MLflow looks in the default ./mlruns directory and shows no experiments.

Open in Browser

Navigate to the local server:

http://127.0.0.1:5000

You should see your experiments listed by name: autopilot_{dataset}_{session_id}

MLflow Storage Location

All tracking data is stored locally in the outputs/mlruns/ directory (defined in src/config.py:22):

outputs/
└── mlruns/
    ├── 0/                          # Default experiment (metadata)
    ├── 1/                          # autopilot_california_housing_a1b2c3d4
    │   ├── meta.yaml
    │   ├── {run_id}/               # Individual run directories
    │   │   ├── meta.yaml           # Run metadata
    │   │   ├── metrics/            # Metric values (RMSE, R², etc.)
    │   │   ├── params/             # Parameters (model_type, hyperparams)
    │   │   ├── tags/               # Tags (hypothesis, success)
    │   │   └── artifacts/          # Code, plots, reasoning.txt
    └── ...

Each experiment session creates a new MLflow experiment. Multiple runs are logged within each experiment.

What Gets Logged

The MLflowTracker (defined in src/persistence/mlflow_tracker.py) logs comprehensive data for each iteration:

Parameters

Logged in log_experiment() at src/persistence/mlflow_tracker.py:103-112:

model_type

string

Model class name (e.g., XGBRegressor, RandomForestClassifier)

iteration

integer

Iteration number (0 = baseline)

model_{param}

any

All model hyperparameters with model_ prefix:

model_max_depth: 5
model_learning_rate: 0.05
model_n_estimators: 100

preprocessing_*

string

Preprocessing configuration:

preprocessing_missing: “median”
preprocessing_scaling: “standard”
preprocessing_encoding: “onehot”

Metrics

Logged in log_experiment() at src/persistence/mlflow_tracker.py:114-120:

Task-Specific Metrics

dict

All metrics from the experiment result:Regression: rmse, mae, r2Classification: accuracy, f1, precision, recall, roc_auc

execution_time

float

Time in seconds to train and evaluate the model

success

integer

1 if experiment succeeded, 0 if failed

Artifacts

Logged files attached to each run:

Artifact	Path	Description
reasoning.txt	`artifacts/reasoning.txt`	Gemini’s reasoning for experiment design
experiment_.py	`artifacts/experiment_{N}.py`	Generated Python training script
error.txt	`artifacts/error.txt`	Error message if experiment failed
visualizations	`artifacts/*.png`	Charts (final_summary run only)

Navigating the MLflow UI

Experiments List

The home page shows all experiments:

Experiments
├── autopilot_california_housing_a1b2c3d4  (5 runs)
├── autopilot_bank_b2c3d4e5               (3 runs)
└── Default                                (0 runs)

Click an experiment name to view its runs.

Runs Table

The runs table displays all iterations:

Run Name	Start Time	Duration	rmse	r2	success
data_profile	2026-03-02 10:00:00	2s	-	-	-
baseline	2026-03-02 10:00:05	8s	0.7456	0.6012	1
log_transform_rf	2026-03-02 10:00:15	12s	0.4201	0.7834	1
xgboost_tuned	2026-03-02 10:00:30	15s	0.1332	0.8456	1
final_summary	2026-03-02 10:00:50	1s	-	-	-

Click column headers to sort by any metric. Use the search bar to filter runs.

Special Runs

data_profile

run

Logged at the start of each session (in log_data_profile() at src/persistence/mlflow_tracker.py:60-91).Parameters:

n_rows, n_columns
n_numeric_features, n_categorical_features
target_column, target_type

Metrics:

total_missing_values

Artifacts:

data_profile.json — Full data profile

final_summary

run

Logged at the end of each session (in log_final_summary() at src/persistence/mlflow_tracker.py:151-185).Metrics:

total_iterations
successful_experiments
total_time_seconds
best_metric

Tags:

best_experiment
termination_reason
phase

Artifacts:

final_state.json — Complete experiment state
Visualization plots (*.png)

Run Detail View

Click any run name to view detailed information:

Overview
Parameters
Metrics
Tags
Artifacts

High-level run information:

Run ID and name
Start time and duration
User and source

All logged parameters in key-value format:

model_type: XGBRegressor
iteration: 3
model_max_depth: 5
model_learning_rate: 0.05
model_n_estimators: 100
preprocessing_missing: median
preprocessing_scaling: standard

All logged metrics with values:

rmse: 0.1332
mae: 0.0891
r2: 0.8456
execution_time: 15.2
success: 1.0

Tagged metadata:

hypothesis: "XGBoost with regularization to reduce overfitting"
success: "True"

Downloadable files:

reasoning.txt — Click to view or download
experiment_3.py — Generated training script

Click any artifact to preview or download.

Comparing Runs

Select Runs

Check the boxes next to 2+ runs you want to compare.

Click Compare

Click the “Compare” button at the top of the runs table.

View Comparison

MLflow displays:

Parallel Coordinates Plot: Visualize parameter vs. metric relationships
Scatter Plot: Plot any two metrics against each other
Contour Plot: For hyperparameter tuning analysis
Table View: Side-by-side parameter and metric comparison

Example Comparison

Comparing iterations 1, 2, and 3:

Run	model_type	model_max_depth	rmse	r2
log_transform_rf	RandomForestRegressor	10	0.4201	0.7834
xgboost_initial	XGBRegressor	3	0.3567	0.8123
xgboost_tuned	XGBRegressor	5	0.1332	0.8456

Insights:

Deeper trees (max_depth 5 vs 3) improved XGBoost performance
XGBoost outperforms RandomForest on this dataset
RMSE reduced by 68% from iteration 1 to 3

Searching and Filtering

Filter by Metric

Use the search bar with MLflow’s query syntax:

-- Runs with RMSE < 0.2
metrics.rmse < 0.2

-- Successful runs only
metrics.success = 1

-- Runs with high R²
metrics.r2 > 0.8

-- Combine conditions
metrics.rmse < 0.2 AND metrics.r2 > 0.8

Filter by Parameter

-- Only XGBoost models
params.model_type = "XGBRegressor"

-- Max depth between 3 and 5
params.model_max_depth >= "3" AND params.model_max_depth <= "5"

Parameters are stored as strings in MLflow. Use string comparisons even for numeric values.

Filter by Tag

-- Successful experiments
tags.success = "True"

-- Specific hypothesis (partial match)
tags.hypothesis LIKE "%regularization%"

Downloading Artifacts

Navigate to Run

Click a run name to open the detail view.

Open Artifacts Tab

Select the “Artifacts” tab.

Download Files

Click any file to preview in browser
Right-click → “Save As” to download
Download entire artifact directory as ZIP

Useful Artifacts

reasoning.txt

Gemini’s full reasoning for designing the experiment:

Based on the previous experiments, I've observed that tree-based
models outperform linear models by 40%. The data profile shows
right-skewed target distribution, suggesting log transformation.

For this iteration, I'm testing RandomForestRegressor with:
- n_estimators: 100 (balance between performance and speed)
- max_depth: 10 (prevent overfitting observed in deeper trees)
- Log-transformed target (hypothesis: reduce skew impact)

experiment_{N}.py

The generated Python script that was executed:

import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
import json

# Load data
data = pd.read_csv("data/sample/california_housing.csv")
# ... (full training script)

Useful for:

Reproducing results manually
Debugging preprocessing steps
Understanding exact model configuration

data_profile.json

Complete dataset analysis from the data_profile run:

{
  "n_rows": 20640,
  "n_columns": 9,
  "numeric_columns": ["MedInc", "HouseAge", ...],
  "missing_values": {"total_bedrooms": 207},
  "numeric_stats": {
    "MedInc": {"mean": 3.87, "std": 1.90, ...}
  }
}

Programmatic Access

Query MLflow data programmatically using the MLflowTracker API:

from src.persistence.mlflow_tracker import MLflowTracker

# Create tracker
tracker = MLflowTracker(
    experiment_name="autopilot_california_housing_a1b2c3d4",
    tracking_uri="file:./outputs/mlruns"
)

# Get best run by RMSE (lower is better)
best_run = tracker.get_best_run(metric_name="rmse", ascending=True)
print(best_run)
# {
#   "run_id": "abc123",
#   "run_name": "xgboost_tuned",
#   "metrics": {"rmse": 0.1332, "r2": 0.8456},
#   "params": {"model_type": "XGBRegressor", ...}
# }

# Get all runs
all_runs = tracker.get_all_runs()
for run in all_runs:
    print(f"{run['run_name']}: RMSE={run['metrics'].get('rmse', 'N/A')}")

Troubleshooting

No experiments appear in MLflow UI

Cause: Incorrect --backend-store-uriSolution:

# Correct command
mlflow ui --backend-store-uri file:./outputs/mlruns

# Run from project root directory
cd /path/to/ml-experiment-autopilot
mlflow ui --backend-store-uri file:./outputs/mlruns

Port 5000 already in use

Cause: Another process using port 5000Solution: Specify a different port

mlflow ui --backend-store-uri file:./outputs/mlruns --port 5001
# Open http://127.0.0.1:5001

Runs missing metrics

Cause: Experiment failed before metrics loggedSolution: Check success metric:

Filter for metrics.success = 0
Download error.txt artifact to see failure reason
Check outputs/experiments/{session_id}/ for generated code

MLflow UI is slow

Cause: Large number of runs in experimentSolution:

Use search filters to reduce visible runs
Archive old experiments (move directories out of mlruns/)
Consider upgrading to SQLite/PostgreSQL backend for production

Next Steps

Understanding Results

Interpret metrics and analysis outputs

Troubleshooting

Resolve common issues

Get Started

Core Concepts

CLI Reference

Guides

Examples

Overview

Quick Start

MLflow Storage Location

What Gets Logged

Parameters

Metrics

Tags

Artifacts

Navigating the MLflow UI

Experiments List

Runs Table

Special Runs

Run Detail View

Comparing Runs

Example Comparison

Searching and Filtering

Filter by Metric

Filter by Parameter

Filter by Tag

Downloading Artifacts

Useful Artifacts

Programmatic Access

Troubleshooting

Next Steps

Understanding Results

Troubleshooting

Build docs developers (and LLMs) love

Get Started

Core Concepts

CLI Reference

Guides

Examples

​Overview

​Quick Start

​MLflow Storage Location

​What Gets Logged

​Parameters

​Metrics

​Tags

​Artifacts

​Navigating the MLflow UI

​Experiments List

​Runs Table

​Special Runs

​Run Detail View

​Comparing Runs

​Example Comparison

​Searching and Filtering

​Filter by Metric

​Filter by Parameter

​Filter by Tag

​Downloading Artifacts

​Useful Artifacts

​Programmatic Access

​Troubleshooting

​Next Steps

Understanding Results

Troubleshooting

Build docs developers (and LLMs) love

Overview

Quick Start

MLflow Storage Location

What Gets Logged

Parameters

Metrics

Tags

Artifacts

Navigating the MLflow UI

Experiments List

Runs Table

Special Runs

Run Detail View

Comparing Runs

Example Comparison

Searching and Filtering

Filter by Metric

Filter by Parameter

Filter by Tag

Downloading Artifacts

Useful Artifacts

Programmatic Access

Troubleshooting

Next Steps