Skip to main content

Overview

ML Experiment Autopilot generates comprehensive results for every experiment iteration. This guide explains how to interpret metrics, analysis outputs, trend patterns, and final reports.

Result Structure

Each experiment produces an ExperimentResult (defined in src/orchestration/state.py:70-88) containing:
class ExperimentResult:
    experiment_name: str          # e.g., "xgboost_tuned_lr"
    iteration: int                # 0 = baseline, 1+ = iterations
    model_type: str               # e.g., "XGBRegressor"
    model_params: dict            # Hyperparameters used
    preprocessing: PreprocessingConfig
    metrics: dict[str, float]     # All computed metrics
    hypothesis: str               # Hypothesis being tested
    reasoning: str                # Gemini's reasoning
    execution_time: float         # Seconds to train + evaluate
    success: bool                 # True if completed successfully
    error_message: str | None     # If failed
    code_path: str                # Path to generated script
    timestamp: datetime

Console Output

During execution, results are displayed with rich formatting:

Iteration Header

┌─────────────────────────────────────────────────────────────┐
│ ITERATION 3 / 20                                            │
│ Experiment: xgboost_regularized                             │
└─────────────────────────────────────────────────────────────┘

Results Panel

┌─────────────────────────────────────────────────────────────┐
│ RESULTS                                                     │
├─────────────────────────────────────────────────────────────┤
│ Success: ✓                                                  │
│ Execution time: 12.3 seconds                                │
│                                                             │
│ Metrics:                                                    │
│   rmse: 0.1332                                              │
│   mae: 0.0891                                               │
│   r2: 0.8456                                                │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│ ★ NEW BEST                                                  │
│ xgboost_regularized                                         │
│ rmse: 0.1332 (82.1% better than baseline)                  │
└─────────────────────────────────────────────────────────────┘

Analysis Output

┌─────────────────────────────────────────────────────────────┐
│ RESULTS ANALYSIS                                            │
├─────────────────────────────────────────────────────────────┤
│ Trend: IMPROVING                                            │
│ RMSE: 0.1332   ★ NEW BEST                                   │
│   82.1% better than baseline                                │
│                                                             │
│ Key Observations:                                           │
│   - Boosting provided 10.3% improvement over bagging        │
│   - Log transformation remains critical for this target     │
│   - Diminishing returns suggest stopping after next round   │
└─────────────────────────────────────────────────────────────┘

Metrics Explained

Regression Metrics

For --task regression, the autopilot computes:
rmse
float
Root Mean Squared Error — Lower is betterAverage prediction error magnitude. Penalizes large errors more than MAE.
rmse = sqrt(mean((y_true - y_pred)²))
Typical values: 0.1-1.0 for normalized targets, higher for raw scales
mae
float
Mean Absolute Error — Lower is betterAverage absolute difference between predictions and actuals.
mae = mean(|y_true - y_pred|)
More interpretable than RMSE; less sensitive to outliers.
r2
float
R² Score (Coefficient of Determination) — Higher is better (max 1.0)Proportion of variance explained by the model.
r2 = 1 - (sum((y_true - y_pred)²) / sum((y_true - mean(y_true))²))
  • 1.0: Perfect predictions
  • 0.0: No better than predicting the mean
  • Negative: Worse than predicting the mean

Classification Metrics

For --task classification, the autopilot computes:
accuracy
float
Accuracy — Higher is better (0-1)Fraction of correct predictions.
accuracy = (TP + TN) / (TP + TN + FP + FN)
Caution: Misleading for imbalanced datasets.
f1
float
F1 Score — Higher is better (0-1)Harmonic mean of precision and recall.
f1 = 2 * (precision * recall) / (precision + recall)
Better for imbalanced datasets than accuracy.
precision
float
Precision — Higher is better (0-1)Fraction of positive predictions that are correct.
precision = TP / (TP + FP)
recall
float
Recall (Sensitivity) — Higher is better (0-1)Fraction of actual positives correctly identified.
recall = TP / (TP + FN)
roc_auc
float
ROC AUC — Higher is better (0.5-1.0)Area under the ROC curve. Measures separability.
  • 1.0: Perfect separation
  • 0.5: Random guessing

Metric Comparisons

The ResultsAnalyzer (defined in src/cognitive/results_analyzer.py:142-226) computes comparisons:
class MetricComparison:
    metric_name: str                    # e.g., "rmse"
    current_value: float                # Current iteration value
    baseline_value: float | None        # Iteration 0 value
    best_value: float | None            # Best across all iterations
    previous_value: float | None        # Previous iteration value
    change_from_baseline_pct: float     # % improvement from baseline
    change_from_best_pct: float         # % difference from best
    change_from_previous_pct: float     # % change from previous
    is_improvement: bool                # Exceeds improvement threshold
    is_new_best: bool                   # New best value achieved

Improvement Calculation

The autopilot correctly handles “lower is better” vs “higher is better” metrics: Lower is better (RMSE, MAE, MSE, log_loss):
improvement_pct = ((baseline - current) / |baseline|) * 100
# Positive % = improvement (lower value)
Higher is better (accuracy, F1, R², precision, recall):
improvement_pct = ((current - baseline) / |baseline|) * 100
# Positive % = improvement (higher value)
The default improvement threshold is 0.5% relative change. This is configurable in src/config.py:53.

Trend Patterns

The ResultsAnalyzer detects performance trends across iterations (defined in src/orchestration/state.py:38-45):
IMPROVING
trend
Consistent improvement over last 2+ iterations (each >0.5% better).Action: Continue current strategy (exploitation).
DEGRADING
trend
Consistent degradation over last 2+ iterations.Action: Explore different model families or preprocessing.
PLATEAU
trend
Less than 0.5% change for 2+ consecutive iterations.Action: Consider stopping or trying radical changes.
FLUCTUATING
trend
Alternating improvement and degradation.Action: Reduce exploration, focus on stable approaches.
INITIAL
trend
Fewer than 3 experiments; insufficient data for pattern detection.Action: Continue exploring.

Analysis Result

Each iteration produces an AnalysisResult (defined in src/orchestration/state.py:152-162):
class AnalysisResult:
    experiment_name: str
    iteration: int
    success: bool
    primary_metric: MetricComparison | None
    trend_pattern: TrendPattern
    key_observations: list[str]      # Gemini-generated insights
    reasoning: str                   # Detailed analysis
    timestamp: datetime

Key Observations

Gemini generates 3-5 specific, actionable observations per iteration:
[
  "XGBoost with max_depth=5 achieved RMSE of 0.1332, 10.3% better than RandomForest",
  "Log transformation of target variable remains critical—untransformed models fail",
  "Learning rate 0.05 provides better generalization than 0.1 (validation RMSE)",
  "Feature importance shows 'MedInc' and 'AveRooms' dominate predictions",
  "Diminishing returns observed: last 2 iterations improved by <2% each"
]

Hypotheses

After analysis, Gemini generates ranked hypotheses for the next iteration (defined in src/orchestration/state.py:165-191):
class Hypothesis:
    hypothesis_id: str               # "h1", "h2", "h3"
    statement: str                   # Clear hypothesis statement
    rationale: str                   # Why worth testing
    suggested_model: str | None      # e.g., "LGBMRegressor"
    suggested_params: dict           # Recommended hyperparameters
    confidence_score: float          # 0-1 confidence in success
    priority: int                    # 1=highest, 3=lowest

class HypothesisSet:
    iteration: int
    analysis_summary: str
    hypotheses: list[Hypothesis]     # 1-3 hypotheses
    exploration_vs_exploitation: str # "explore", "exploit", "balanced"
    reasoning: str

Example Hypothesis Output

┌─────────────────────────────────────────────────────────────┐
│ HYPOTHESES FOR NEXT ITERATION                               │
├─────────────────────────────────────────────────────────────┤
│ Strategy: exploit                                           │
│                                                             │
│ 1. [Priority: 1] Fine-tune XGBoost regularization          │
│    Confidence: 0.72 | Models: XGBRegressor                  │
│    Rationale: XGBoost shows best performance; alpha/lambda  │
│    tuning may reduce overfitting observed in validation     │
│                                                             │
│ 2. [Priority: 2] Try LightGBM as alternative booster        │
│    Confidence: 0.65 | Models: LGBMRegressor                 │
│    Rationale: LightGBM often faster and comparable to XGB   │
└─────────────────────────────────────────────────────────────┘

Final Report

At completion, Gemini generates a narrative Markdown report in outputs/reports/ containing:

Report Sections

1

Executive Summary

High-level overview of the experimental journey:
  • Dataset characteristics (rows, features, task type)
  • Total iterations completed
  • Best model and performance
  • Key insights discovered
2

Methodology

  • Data profiling results
  • Baseline model selection
  • Experiment design approach
  • Termination criteria
3

Results Table

Sortable table with all experiments:
IterationExperimentModelRMSESuccess
0baselineLinearRegression0.74560.6012
1log_transformRandomForestRegressor0.42010.7834
2xgboost_tunedXGBRegressor0.13320.8456
4

Best Model Details

  • Model type and hyperparameters
  • Preprocessing configuration
  • Final metrics with comparisons
  • Feature importance (if available)
5

Key Insights

Gemini’s narrative insights:
  • Why certain approaches worked
  • Failed experiments and learnings
  • Recommendations for production
  • Suggestions for further improvement
6

Visualizations

Embedded charts:
  • Metric progression across iterations (line chart)
  • Model comparison (bar chart)
  • Improvement over baseline (bar chart)
Generated by VisualizationGenerator in src/execution/visualization_generator.py.
7

Experiment Appendix

Per-experiment details:
  • Full hypothesis and reasoning
  • Complete hyperparameters
  • Preprocessing steps
  • Execution time
  • Error messages (if failed)

Session State

The complete experiment state is saved to outputs/state_{session_id}.json:
{
  "session_id": "a1b2c3d4",
  "config": {
    "data_path": "data/sample/california_housing.csv",
    "target_column": "MedHouseVal",
    "task_type": "regression",
    "max_iterations": 5,
    "primary_metric": "rmse"
  },
  "data_profile": { ... },
  "experiments": [ ... ],
  "current_iteration": 5,
  "phase": "completed",
  "best_metric": 0.1332,
  "best_experiment": "xgboost_regularized",
  "termination_reason": "Performance plateau detected"
}
This file can be used to:
  • Resume interrupted experiments (--resume)
  • Audit experimental history
  • Reproduce results
  • Analyze Gemini’s decision patterns

Interpreting Common Patterns

Observation: Best model is 80%+ better than baseline.Interpretation:
  • Dataset benefits significantly from sophisticated models
  • Baseline (linear model) was too simple
  • Gemini successfully identified effective strategies
Action: Trust the best model; consider even more iterations.
Observation: No improvement after 3-5 iterations.Interpretation:
  • Dataset may be simple (baseline already near-optimal)
  • Limited feature information available
  • Need different preprocessing (feature engineering)
Action: Review data quality; consider manual feature engineering.
Observation: Multiple experiments fail with errors.Interpretation:
  • Check error_message in results
  • Common causes: memory issues, hyperparameter conflicts, data type mismatches
  • Generated code saved in outputs/experiments/ for debugging
Action: See Troubleshooting guide.
Observation: Metrics vary wildly between iterations.Interpretation:
  • High variance models (e.g., deep trees without regularization)
  • Train/test split instability
  • Data leakage or inconsistent preprocessing
Action: Check preprocessing consistency; consider cross-validation.

Next Steps

MLflow Tracking

Explore detailed metrics in the MLflow UI

Troubleshooting

Resolve common issues and errors

Build docs developers (and LLMs) love