Overview
ML Experiment Autopilot generates comprehensive results for every experiment iteration. This guide explains how to interpret metrics, analysis outputs, trend patterns, and final reports.Result Structure
Each experiment produces anExperimentResult (defined in src/orchestration/state.py:70-88) containing:
Console Output
During execution, results are displayed with rich formatting:Iteration Header
Results Panel
Analysis Output
Metrics Explained
Regression Metrics
For--task regression, the autopilot computes:
Root Mean Squared Error — Lower is betterAverage prediction error magnitude. Penalizes large errors more than MAE.Typical values: 0.1-1.0 for normalized targets, higher for raw scales
Mean Absolute Error — Lower is betterAverage absolute difference between predictions and actuals.More interpretable than RMSE; less sensitive to outliers.
R² Score (Coefficient of Determination) — Higher is better (max 1.0)Proportion of variance explained by the model.
- 1.0: Perfect predictions
- 0.0: No better than predicting the mean
- Negative: Worse than predicting the mean
Classification Metrics
For--task classification, the autopilot computes:
Accuracy — Higher is better (0-1)Fraction of correct predictions.Caution: Misleading for imbalanced datasets.
F1 Score — Higher is better (0-1)Harmonic mean of precision and recall.Better for imbalanced datasets than accuracy.
Precision — Higher is better (0-1)Fraction of positive predictions that are correct.
Recall (Sensitivity) — Higher is better (0-1)Fraction of actual positives correctly identified.
ROC AUC — Higher is better (0.5-1.0)Area under the ROC curve. Measures separability.
- 1.0: Perfect separation
- 0.5: Random guessing
Metric Comparisons
TheResultsAnalyzer (defined in src/cognitive/results_analyzer.py:142-226) computes comparisons:
Improvement Calculation
The autopilot correctly handles “lower is better” vs “higher is better” metrics: Lower is better (RMSE, MAE, MSE, log_loss):Trend Patterns
TheResultsAnalyzer detects performance trends across iterations (defined in src/orchestration/state.py:38-45):
Consistent improvement over last 2+ iterations (each >0.5% better).Action: Continue current strategy (exploitation).
Consistent degradation over last 2+ iterations.Action: Explore different model families or preprocessing.
Less than 0.5% change for 2+ consecutive iterations.Action: Consider stopping or trying radical changes.
Alternating improvement and degradation.Action: Reduce exploration, focus on stable approaches.
Fewer than 3 experiments; insufficient data for pattern detection.Action: Continue exploring.
Analysis Result
Each iteration produces anAnalysisResult (defined in src/orchestration/state.py:152-162):
Key Observations
Gemini generates 3-5 specific, actionable observations per iteration:Hypotheses
After analysis, Gemini generates ranked hypotheses for the next iteration (defined insrc/orchestration/state.py:165-191):
Example Hypothesis Output
Final Report
At completion, Gemini generates a narrative Markdown report inoutputs/reports/ containing:
Report Sections
Executive Summary
High-level overview of the experimental journey:
- Dataset characteristics (rows, features, task type)
- Total iterations completed
- Best model and performance
- Key insights discovered
Methodology
- Data profiling results
- Baseline model selection
- Experiment design approach
- Termination criteria
Results Table
Sortable table with all experiments:
| Iteration | Experiment | Model | RMSE | R² | Success |
|---|---|---|---|---|---|
| 0 | baseline | LinearRegression | 0.7456 | 0.6012 | ✓ |
| 1 | log_transform | RandomForestRegressor | 0.4201 | 0.7834 | ✓ |
| 2 | xgboost_tuned | XGBRegressor | 0.1332 | 0.8456 | ✓ |
Best Model Details
- Model type and hyperparameters
- Preprocessing configuration
- Final metrics with comparisons
- Feature importance (if available)
Key Insights
Gemini’s narrative insights:
- Why certain approaches worked
- Failed experiments and learnings
- Recommendations for production
- Suggestions for further improvement
Visualizations
Embedded charts:
- Metric progression across iterations (line chart)
- Model comparison (bar chart)
- Improvement over baseline (bar chart)
VisualizationGenerator in src/execution/visualization_generator.py.Session State
The complete experiment state is saved tooutputs/state_{session_id}.json:
- Resume interrupted experiments (
--resume) - Audit experimental history
- Reproduce results
- Analyze Gemini’s decision patterns
Interpreting Common Patterns
Large Baseline-to-Best Gap
Large Baseline-to-Best Gap
Observation: Best model is 80%+ better than baseline.Interpretation:
- Dataset benefits significantly from sophisticated models
- Baseline (linear model) was too simple
- Gemini successfully identified effective strategies
Early Plateau
Early Plateau
Observation: No improvement after 3-5 iterations.Interpretation:
- Dataset may be simple (baseline already near-optimal)
- Limited feature information available
- Need different preprocessing (feature engineering)
Failed Experiments
Failed Experiments
Observation: Multiple experiments fail with errors.Interpretation:
- Check
error_messagein results - Common causes: memory issues, hyperparameter conflicts, data type mismatches
- Generated code saved in
outputs/experiments/for debugging
Fluctuating Performance
Fluctuating Performance
Observation: Metrics vary wildly between iterations.Interpretation:
- High variance models (e.g., deep trees without regularization)
- Train/test split instability
- Data leakage or inconsistent preprocessing
Next Steps
MLflow Tracking
Explore detailed metrics in the MLflow UI
Troubleshooting
Resolve common issues and errors