MLflow Tracking

Overview

The MLflowTracker class provides MLflow integration for tracking experiments, logging metrics, parameters, and artifacts. It features graceful degradation if MLflow fails, ensuring experiments continue even without tracking.

Features

Local MLflow tracking - Stores data in local mlruns/ directory
Automatic experiment creation - Creates MLflow experiments automatically
Metric and parameter logging - Tracks all experiment metadata
Artifact storage - Saves code, profiles, and visualizations
Graceful degradation - Continues if MLflow fails

Class Definition

MLflowTracker

from src.persistence.mlflow_tracker import MLflowTracker

tracker = MLflowTracker(
    experiment_name="autopilot_housing_abc123",
    tracking_uri="file:///path/to/mlruns"
)

experiment_name

str

required

Name of the MLflow experiment. Typically formatted as: autopilot_{dataset_name}_{session_id}

tracking_uri

Optional[str]

MLflow tracking URI. Defaults to local mlruns/ directory.Examples:

"file:///path/to/mlruns" - Local file storage
"http://localhost:5000" - Remote MLflow server
"databricks" - Databricks workspace

Properties

experiment_name

str

MLflow experiment name

experiment_id

str

MLflow experiment ID (auto-generated)

disabled

bool

True if MLflow initialization failed

client

MlflowClient

MLflow client instance for queries

Methods

log_data_profile()

Log dataset profile as experiment metadata.

tracker.log_data_profile(profile)

profile

DataProfile

required

DataProfile from DataProfiler

Logs:

Parameters: Dataset dimensions, feature counts, target info
Metrics: Total missing values
Artifacts: Full profile as data_profile.json

Example logged parameters:

{
    "n_rows": 1000,
    "n_columns": 15,
    "n_numeric_features": 8,
    "n_categorical_features": 6,
    "target_column": "price",
    "target_type": "numeric"
}

log_experiment()

Log a single experiment run with all metadata.

tracker.log_experiment(result)

result

ExperimentResult

required

ExperimentResult from ExperimentRunner

Logs:

Parameters: Model type, hyperparameters, preprocessing config
Metrics: All performance metrics, execution time, success flag
Tags: Hypothesis (truncated to 250 chars), success status
Artifacts:
- Generated Python code
- Reasoning text file
- Error message (if failed)

Example logged parameters:

{
    "model_type": "RandomForestClassifier",
    "iteration": 3,
    "model_n_estimators": 200,
    "model_max_depth": 10,
    "model_min_samples_split": 5,
    "preprocessing_missing": "median",
    "preprocessing_scaling": "standard",
    "preprocessing_encoding": "onehot"
}

Example logged metrics:

{
    "accuracy": 0.87,
    "f1": 0.85,
    "precision": 0.86,
    "recall": 0.84,
    "execution_time": 45.3,
    "success": 1
}

log_final_summary()

Log final experiment summary after all iterations.

tracker.log_final_summary(state)

state

ExperimentState

required

Final ExperimentState after all iterations

Logs:

Metrics: Total iterations, successful experiments, total time, best metric
Tags: Best experiment name, termination reason, final phase
Artifacts: Complete state as final_state.json

log_visualizations()

Log visualization plots as artifacts.

tracker.log_visualizations(plot_paths)

plot_paths

list[Path]

required

List of paths to generated plot PNG files

Example:

plot_paths = [
    Path("plots/metric_progression.png"),
    Path("plots/model_comparison.png"),
    Path("plots/improvement_over_baseline.png")
]
tracker.log_visualizations(plot_paths)

get_best_run()

Query for the best run based on a metric.

best_run = tracker.get_best_run(
    metric_name="rmse",
    ascending=True
)

metric_name

str

required

Name of the metric to optimize

ascending

bool

Whether lower values are better. Default: True

True for RMSE, MAE, loss metrics
False for accuracy, F1, R² metrics

Returns: Optional[dict] with:

run_id (str) - MLflow run ID
run_name (str) - Run name
metrics (dict) - All metrics
params (dict) - All parameters

Example:

best = tracker.get_best_run("f1", ascending=False)
if best:
    print(f"Best F1: {best['metrics']['f1']:.4f}")
    print(f"Model: {best['params']['model_type']}")

get_all_runs()

Retrieve all runs in the experiment.

runs = tracker.get_all_runs()

Returns: list[dict] - List of run dictionaries, each with:

run_id (str)
run_name (str)
metrics (dict)
params (dict)
status (str) - “FINISHED”, “RUNNING”, “FAILED”, etc.

Example:

runs = tracker.get_all_runs()
for run in runs:
    print(f"{run['run_name']}: {run['status']}")
    if 'f1' in run['metrics']:
        print(f"  F1: {run['metrics']['f1']:.4f}")

Helper Functions

create_tracker()

Create a tracker with standardized naming.

from src.persistence.mlflow_tracker import create_tracker

tracker = create_tracker(
    session_id="abc123",
    dataset_name="housing"
)
# Creates experiment named: "autopilot_housing_abc123"

session_id

str

required

Unique session identifier

dataset_name

str

required

Name of the dataset being used

Complete Example

from pathlib import Path
from src.persistence.mlflow_tracker import create_tracker
from src.execution.data_profiler import DataProfiler
from src.execution.experiment_runner import ExperimentRunner
from src.orchestration.state import ExperimentSpec, ExperimentState

# Create tracker
tracker = create_tracker(
    session_id="abc123",
    dataset_name="titanic"
)

# Log data profile
profiler = DataProfiler(
    data_path=Path("titanic.csv"),
    target_column="survived",
    task_type="classification"
)
profile = profiler.profile()
tracker.log_data_profile(profile)

# Run and log experiments
runner = ExperimentRunner()
for spec in experiment_specs:
    # Generate and run experiment
    script_path = generator.generate(spec, ...)
    result = runner.run(script_path, spec, iteration=i)
    
    # Log to MLflow
    tracker.log_experiment(result)
    
    if result.success:
        print(f"✓ {result.experiment_name}: {result.metrics}")

# Log final summary
tracker.log_final_summary(state)

# Log visualizations
plot_paths = viz_generator.generate(state, output_dir)
tracker.log_visualizations(plot_paths)

# Query best result
best_run = tracker.get_best_run("f1", ascending=False)
print(f"Best F1: {best_run['metrics']['f1']:.4f}")
print(f"Best model: {best_run['params']['model_type']}")

Viewing Results in MLflow UI

Start the MLflow UI to view tracked experiments:

mlflow ui --backend-store-uri ./mlruns

Then open http://localhost:5000 in your browser. UI features:

Compare experiments side-by-side
Visualize metric trends
Download artifacts (code, profiles, plots)
Search and filter runs
Export to CSV

Error Handling

The tracker gracefully handles MLflow failures:

tracker = MLflowTracker(experiment_name="test")

if tracker.disabled:
    print("MLflow tracking disabled due to initialization failure")
    # Experiments continue without tracking
else:
    print(f"Tracking to: {tracker.tracking_uri}")
    tracker.log_experiment(result)

All logging methods fail silently with warnings if MLflow is disabled. Experiments continue uninterrupted.

Example warning:

⚠ MLflow initialization failed, tracking disabled: No module named 'mlflow'

Remote Tracking Server

Connect to a remote MLflow server:

tracker = MLflowTracker(
    experiment_name="autopilot_housing_abc123",
    tracking_uri="http://mlflow-server:5000"
)

Set environment variable:

export MLFLOW_TRACKING_URI=http://mlflow-server:5000
python main.py

Artifact Organization

Artifacts are organized per run:

mlruns/
└── {experiment_id}/
    ├── {run_id_1}/  # data_profile run
    │   └── artifacts/
    │       └── data_profile.json
    ├── {run_id_2}/  # experiment 1
    │   └── artifacts/
    │       ├── experiment_1.py
    │       └── reasoning.txt
    ├── {run_id_3}/  # experiment 2
    │   └── artifacts/
    │       ├── experiment_2.py
    │       └── reasoning.txt
    └── {run_id_final}/  # final_summary
        └── artifacts/
            ├── final_state.json
            ├── metric_progression.png
            ├── model_comparison.png
            └── improvement_over_baseline.png

Querying Experiments

Search by metric threshold

from mlflow.tracking import MlflowClient

client = tracker.client
runs = client.search_runs(
    experiment_ids=[tracker.experiment_id],
    filter_string="metrics.f1 > 0.85",
    order_by=["metrics.f1 DESC"]
)

for run in runs:
    print(f"{run.info.run_name}: F1={run.data.metrics['f1']:.4f}")

Get parameter values

runs = tracker.get_all_runs()
for run in runs:
    if run['params'].get('model_type') == 'RandomForestClassifier':
        n_est = run['params'].get('model_n_estimators')
        depth = run['params'].get('model_max_depth')
        f1 = run['metrics'].get('f1')
        print(f"RF(n={n_est}, depth={depth}): F1={f1:.4f}")

Comparing with Baseline

# Get baseline (first experiment)
runs = tracker.get_all_runs()
baseline = runs[-1]  # Oldest run
baseline_f1 = baseline['metrics']['f1']

# Compare all runs
for run in runs[:-1]:
    current_f1 = run['metrics']['f1']
    improvement = (current_f1 - baseline_f1) / baseline_f1 * 100
    print(f"{run['run_name']}: {improvement:+.1f}% vs baseline")

Source Location

~/workspace/source/src/persistence/mlflow_tracker.py

Cognitive Components

Execution Layer

Orchestration

Persistence

Overview

Features

Class Definition

MLflowTracker

Properties

Methods

log_data_profile()

log_experiment()

log_final_summary()

log_visualizations()

get_best_run()

get_all_runs()

Helper Functions

create_tracker()

Complete Example

Viewing Results in MLflow UI

Error Handling

Remote Tracking Server

Artifact Organization

Querying Experiments

Search by metric threshold

Get parameter values

Comparing with Baseline

Source Location

Build docs developers (and LLMs) love

Cognitive Components

Execution Layer

Orchestration

Persistence

​Overview

​Features

​Class Definition

​MLflowTracker

​Properties

​Methods

​log_data_profile()

​log_experiment()

​log_final_summary()

​log_visualizations()

​get_best_run()

​get_all_runs()

​Helper Functions

​create_tracker()

​Complete Example

​Viewing Results in MLflow UI

​Error Handling

​Remote Tracking Server

​Artifact Organization

​Querying Experiments

​Search by metric threshold

​Get parameter values

​Comparing with Baseline

​Source Location

Build docs developers (and LLMs) love

Overview

Features

Class Definition

MLflowTracker

Properties

Methods

log_data_profile()

log_experiment()

log_final_summary()

log_visualizations()

get_best_run()

get_all_runs()

Helper Functions

create_tracker()

Complete Example

Viewing Results in MLflow UI

Error Handling

Remote Tracking Server

Artifact Organization

Querying Experiments

Search by metric threshold

Get parameter values

Comparing with Baseline

Source Location