Skip to main content

Overview

ML Experiment Autopilot automatically logs all experiments to MLflow, providing a web UI for exploring metrics, parameters, and artifacts. This guide covers how to launch and navigate the MLflow interface.

Quick Start

1

Run Experiments

Complete at least one experiment run:
python -m src.main run \
  --data data/sample/california_housing.csv \
  --target MedHouseVal \
  --task regression \
  --max-iterations 3
2

Launch MLflow UI

Start the MLflow server with the correct backend URI:
mlflow ui --backend-store-uri file:./outputs/mlruns
The --backend-store-uri flag is required! Without it, MLflow looks in the default ./mlruns directory and shows no experiments.
3

Open in Browser

Navigate to the local server:
http://127.0.0.1:5000
You should see your experiments listed by name: autopilot_{dataset}_{session_id}

MLflow Storage Location

All tracking data is stored locally in the outputs/mlruns/ directory (defined in src/config.py:22):
outputs/
└── mlruns/
    ├── 0/                          # Default experiment (metadata)
    ├── 1/                          # autopilot_california_housing_a1b2c3d4
    │   ├── meta.yaml
    │   ├── {run_id}/               # Individual run directories
    │   │   ├── meta.yaml           # Run metadata
    │   │   ├── metrics/            # Metric values (RMSE, R², etc.)
    │   │   ├── params/             # Parameters (model_type, hyperparams)
    │   │   ├── tags/               # Tags (hypothesis, success)
    │   │   └── artifacts/          # Code, plots, reasoning.txt
    └── ...
Each experiment session creates a new MLflow experiment. Multiple runs are logged within each experiment.

What Gets Logged

The MLflowTracker (defined in src/persistence/mlflow_tracker.py) logs comprehensive data for each iteration:

Parameters

Logged in log_experiment() at src/persistence/mlflow_tracker.py:103-112:
model_type
string
Model class name (e.g., XGBRegressor, RandomForestClassifier)
iteration
integer
Iteration number (0 = baseline)
model_{param}
any
All model hyperparameters with model_ prefix:
  • model_max_depth: 5
  • model_learning_rate: 0.05
  • model_n_estimators: 100
preprocessing_*
string
Preprocessing configuration:
  • preprocessing_missing: “median”
  • preprocessing_scaling: “standard”
  • preprocessing_encoding: “onehot”

Metrics

Logged in log_experiment() at src/persistence/mlflow_tracker.py:114-120:
Task-Specific Metrics
dict
All metrics from the experiment result:Regression: rmse, mae, r2Classification: accuracy, f1, precision, recall, roc_auc
execution_time
float
Time in seconds to train and evaluate the model
success
integer
1 if experiment succeeded, 0 if failed

Tags

Logged in log_experiment() at src/persistence/mlflow_tracker.py:122-126:
hypothesis
string
The hypothesis being tested (truncated to 250 chars)
success
string
“True” or “False” string representation

Artifacts

Logged files attached to each run:
ArtifactPathDescription
reasoning.txtartifacts/reasoning.txtGemini’s reasoning for experiment design
experiment_.pyartifacts/experiment_{N}.pyGenerated Python training script
error.txtartifacts/error.txtError message if experiment failed
visualizationsartifacts/*.pngCharts (final_summary run only)

Experiments List

The home page shows all experiments:
Experiments
├── autopilot_california_housing_a1b2c3d4  (5 runs)
├── autopilot_bank_b2c3d4e5               (3 runs)
└── Default                                (0 runs)
Click an experiment name to view its runs.

Runs Table

The runs table displays all iterations:
Run NameStart TimeDurationrmser2success
data_profile2026-03-02 10:00:002s---
baseline2026-03-02 10:00:058s0.74560.60121
log_transform_rf2026-03-02 10:00:1512s0.42010.78341
xgboost_tuned2026-03-02 10:00:3015s0.13320.84561
final_summary2026-03-02 10:00:501s---
Click column headers to sort by any metric. Use the search bar to filter runs.

Special Runs

data_profile
run
Logged at the start of each session (in log_data_profile() at src/persistence/mlflow_tracker.py:60-91).Parameters:
  • n_rows, n_columns
  • n_numeric_features, n_categorical_features
  • target_column, target_type
Metrics:
  • total_missing_values
Artifacts:
  • data_profile.json — Full data profile
final_summary
run
Logged at the end of each session (in log_final_summary() at src/persistence/mlflow_tracker.py:151-185).Metrics:
  • total_iterations
  • successful_experiments
  • total_time_seconds
  • best_metric
Tags:
  • best_experiment
  • termination_reason
  • phase
Artifacts:
  • final_state.json — Complete experiment state
  • Visualization plots (*.png)

Run Detail View

Click any run name to view detailed information:
High-level run information:
  • Run ID and name
  • Start time and duration
  • User and source

Comparing Runs

1

Select Runs

Check the boxes next to 2+ runs you want to compare.
2

Click Compare

Click the “Compare” button at the top of the runs table.
3

View Comparison

MLflow displays:
  • Parallel Coordinates Plot: Visualize parameter vs. metric relationships
  • Scatter Plot: Plot any two metrics against each other
  • Contour Plot: For hyperparameter tuning analysis
  • Table View: Side-by-side parameter and metric comparison

Example Comparison

Comparing iterations 1, 2, and 3:
Runmodel_typemodel_max_depthrmser2
log_transform_rfRandomForestRegressor100.42010.7834
xgboost_initialXGBRegressor30.35670.8123
xgboost_tunedXGBRegressor50.13320.8456
Insights:
  • Deeper trees (max_depth 5 vs 3) improved XGBoost performance
  • XGBoost outperforms RandomForest on this dataset
  • RMSE reduced by 68% from iteration 1 to 3

Searching and Filtering

Filter by Metric

Use the search bar with MLflow’s query syntax:
-- Runs with RMSE < 0.2
metrics.rmse < 0.2

-- Successful runs only
metrics.success = 1

-- Runs with high R²
metrics.r2 > 0.8

-- Combine conditions
metrics.rmse < 0.2 AND metrics.r2 > 0.8

Filter by Parameter

-- Only XGBoost models
params.model_type = "XGBRegressor"

-- Max depth between 3 and 5
params.model_max_depth >= "3" AND params.model_max_depth <= "5"
Parameters are stored as strings in MLflow. Use string comparisons even for numeric values.

Filter by Tag

-- Successful experiments
tags.success = "True"

-- Specific hypothesis (partial match)
tags.hypothesis LIKE "%regularization%"

Downloading Artifacts

1

Navigate to Run

Click a run name to open the detail view.
2

Open Artifacts Tab

Select the “Artifacts” tab.
3

Download Files

  • Click any file to preview in browser
  • Right-click → “Save As” to download
  • Download entire artifact directory as ZIP

Useful Artifacts

Gemini’s full reasoning for designing the experiment:
Based on the previous experiments, I've observed that tree-based
models outperform linear models by 40%. The data profile shows
right-skewed target distribution, suggesting log transformation.

For this iteration, I'm testing RandomForestRegressor with:
- n_estimators: 100 (balance between performance and speed)
- max_depth: 10 (prevent overfitting observed in deeper trees)
- Log-transformed target (hypothesis: reduce skew impact)
The generated Python script that was executed:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
import json

# Load data
data = pd.read_csv("data/sample/california_housing.csv")
# ... (full training script)
Useful for:
  • Reproducing results manually
  • Debugging preprocessing steps
  • Understanding exact model configuration
Complete dataset analysis from the data_profile run:
{
  "n_rows": 20640,
  "n_columns": 9,
  "numeric_columns": ["MedInc", "HouseAge", ...],
  "missing_values": {"total_bedrooms": 207},
  "numeric_stats": {
    "MedInc": {"mean": 3.87, "std": 1.90, ...}
  }
}

Programmatic Access

Query MLflow data programmatically using the MLflowTracker API:
from src.persistence.mlflow_tracker import MLflowTracker

# Create tracker
tracker = MLflowTracker(
    experiment_name="autopilot_california_housing_a1b2c3d4",
    tracking_uri="file:./outputs/mlruns"
)

# Get best run by RMSE (lower is better)
best_run = tracker.get_best_run(metric_name="rmse", ascending=True)
print(best_run)
# {
#   "run_id": "abc123",
#   "run_name": "xgboost_tuned",
#   "metrics": {"rmse": 0.1332, "r2": 0.8456},
#   "params": {"model_type": "XGBRegressor", ...}
# }

# Get all runs
all_runs = tracker.get_all_runs()
for run in all_runs:
    print(f"{run['run_name']}: RMSE={run['metrics'].get('rmse', 'N/A')}")

Troubleshooting

Cause: Incorrect --backend-store-uriSolution:
# Correct command
mlflow ui --backend-store-uri file:./outputs/mlruns

# Run from project root directory
cd /path/to/ml-experiment-autopilot
mlflow ui --backend-store-uri file:./outputs/mlruns
Cause: Another process using port 5000Solution: Specify a different port
mlflow ui --backend-store-uri file:./outputs/mlruns --port 5001
# Open http://127.0.0.1:5001
Cause: Experiment failed before metrics loggedSolution: Check success metric:
  • Filter for metrics.success = 0
  • Download error.txt artifact to see failure reason
  • Check outputs/experiments/{session_id}/ for generated code
Cause: Large number of runs in experimentSolution:
  • Use search filters to reduce visible runs
  • Archive old experiments (move directories out of mlruns/)
  • Consider upgrading to SQLite/PostgreSQL backend for production

Next Steps

Understanding Results

Interpret metrics and analysis outputs

Troubleshooting

Resolve common issues

Build docs developers (and LLMs) love