The experiment loop is the core of ML Experiment Autopilot’s autonomous operation. It’s a continuous cycle where Gemini 3 designs experiments, the system executes them, analyzes results, and generates new hypotheses — all without human intervention.
High-Level Flow
Input: Dataset + Target + Task Type + (optional) Constraints
│
▼
┌────────────────┐
│ DATA PROFILING │ Analyze schema, distributions, missing values
└───────┬────────┘
│
▼
┌────────────────┐
│ BASELINE MODEL │ Simple model to establish performance floor
└───────┬────────┘
│
▼
┌────────────────────────────────────┐
│ ITERATION LOOP │
│ │
│ 1. Experiment Design (Gemini) │ ← hypothesis, model, params
│ 2. Code Generation (Jinja2) │ ← validated Python script
│ 3. Execution (subprocess) │ ← train, evaluate, capture metrics
│ 4. Results Analysis (Gemini) │ ← trends, comparisons, insights
│ 5. Hypothesis Generation (Gemini) │ ← ranked next steps
│ 6. Termination Check │ ← continue or stop?
│ │
│ Repeat until termination... │
└───────────────┬────────────────────┘
│
▼
┌───────────────────┐
│ REPORT GENERATION │ Gemini writes narrative Markdown report
└─────────┬─────────┘
│
▼
Output: Best Model + Report + Visualizations + MLflow Experiment + Code
Phase 1: Data Profiling
Location: src/orchestration/controller.py:160
Before any experiments run, the system analyzes the dataset to understand its characteristics.
What Gets Profiled
The DataProfiler (src/execution/data_profiler.py) extracts:
- Schema: Column names, data types (numeric vs categorical)
- Missing Values: Count and percentage per column
- Numeric Statistics: Mean, std, min, max, quartiles for each numeric column
- Categorical Statistics: Unique values, most common values
- Target Distribution:
- Regression: Mean, std, skewness, min/max
- Classification: Class counts, class balance
Example Profile Output
DataProfile(
n_rows=20640,
n_columns=9,
numeric_columns=["MedInc", "HouseAge", "AveRooms", ...],
categorical_columns=[],
target_column="MedHouseVal",
target_type="continuous",
missing_values={"total_bedrooms": 207},
target_stats={
"mean": 2.07,
"std": 1.15,
"skew": 0.98, # Right-skewed → consider log transform
}
)
Why This Matters
The data profile is sent to Gemini with every experiment design request. It informs:
- Missing value handling strategy (median vs mode vs drop)
- Scaling choices (standard vs minmax vs none)
- Target transformations (log for skewed regression targets)
- Model selection (tree models handle missing values better than linear models)
The data profile is also logged to MLflow and displayed in the console when you run with --verbose.
Phase 2: Baseline Model
Location: src/orchestration/controller.py:194
The baseline establishes a performance floor — the minimum acceptable result.
Baseline Models
| Task Type | Baseline Model | Reasoning |
|---|
| Regression | LinearRegression | Simplest possible model |
| Classification | LogisticRegression | Simplest possible model |
No hyperparameter tuning — just default sklearn settings.
Baseline Result Example
Iteration 0: baseline
Model: LinearRegression
RMSE: 0.7343
Execution Time: 0.8s
This becomes the baseline_value that all future experiments are compared against.
If the baseline fails (e.g., due to data issues), the loop stops immediately. A successful baseline is required to proceed.
Phase 3: Iteration Loop (Steps 1-6)
Location: src/orchestration/controller.py:245
This is where the magic happens. The loop repeats until a termination criterion is met.
Step 1: Experiment Design (Gemini)
Location: src/orchestration/controller.py:330
The ExperimentDesigner uses Gemini 3 to design the next experiment.
Input to Gemini:
- Data profile (schema, stats, missing values)
- Last 5 experiment results (name, model, metrics, hypothesis, success/failure)
- User constraints (if provided)
- Top hypothesis from previous iteration (added to constraints)
Gemini’s Task:
- Formulate a clear hypothesis for what to test
- Select a model (avoid repeating unless with different params)
- Choose hyperparameters
- Specify preprocessing (missing values, scaling, encoding, target transform)
- Explain the reasoning
Output:
{
"experiment_name": "xgboost_tuned_learning_rate",
"hypothesis": "Lower learning rate with more estimators may reduce overfitting",
"model_type": "XGBRegressor",
"model_params": {
"n_estimators": 200,
"learning_rate": 0.05,
"max_depth": 6
},
"preprocessing": {
"missing_values": "median",
"scaling": "standard",
"target_transform": "log"
},
"reasoning": "Previous iteration showed XGBoost overfit with learning_rate=0.1. Reducing to 0.05 and increasing n_estimators should improve generalization."
}
This JSON is parsed into an ExperimentSpec object (src/orchestration/state.py:58).
Hypothesis Context Injection: The top hypothesis from the previous iteration is automatically added to the constraints (src/orchestration/controller.py:336). This creates a feedback loop where each iteration builds on the last.
Step 2: Code Generation (Jinja2)
Location: src/orchestration/controller.py:271
The CodeGenerator (src/execution/code_generator.py) converts the ExperimentSpec into executable Python code using Jinja2 templates.
Template Selection:
sklearn_regressor.py.jinja for LinearRegression, Ridge, RandomForestRegressor, etc.
sklearn_classifier.py.jinja for LogisticRegression, RandomForestClassifier, etc.
xgboost_model.py.jinja for XGBRegressor/XGBClassifier
lightgbm_model.py.jinja for LGBMRegressor/LGBMClassifier
Generated Script Structure:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
import json
# Load data
df = pd.read_csv("data/sample/california_housing.csv")
# Handle missing values (median imputation)
# ... preprocessing code ...
# Apply target transform (log)
y_train_transformed = np.log1p(y_train)
y_test_transformed = np.log1p(y_test)
# Train model
model = XGBRegressor(
n_estimators=200,
learning_rate=0.05,
max_depth=6,
random_state=42,
n_jobs=-1
)
model.fit(X_train_scaled, y_train_transformed)
# Evaluate
y_pred_transformed = model.predict(X_test_scaled)
y_pred = np.expm1(y_pred_transformed) # Inverse log transform
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
# Output JSON
result = {
"metrics": {"rmse": rmse, "mae": mae, "r2": r2},
"model_path": "outputs/models/model_iter_3.pkl",
"success": True
}
print(json.dumps(result))
Validation: The script is parsed with ast.parse() to ensure syntactic correctness before execution.
Saved To: outputs/experiments/<session_id>/experiment_iter_3.py
Step 3: Execution (Subprocess)
Location: src/orchestration/controller.py:294
The ExperimentRunner (src/execution/experiment_runner.py) executes the generated script as a subprocess.
Execution:
process = subprocess.run(
["python", script_path],
capture_output=True,
text=True,
timeout=600, # 10 minutes
)
Output Parsing: The script’s final line is JSON:
{"metrics": {"rmse": 0.1332, ...}, "success": true}
Error Handling: If the script fails (non-zero exit code, timeout, or invalid JSON), the runner creates an ExperimentResult with success=False and captures the error message.
Result: An ExperimentResult object (src/orchestration/state.py:70):
ExperimentResult(
experiment_name="xgboost_tuned_learning_rate",
iteration=3,
model_type="XGBRegressor",
model_params={"n_estimators": 200, "learning_rate": 0.05, "max_depth": 6},
metrics={"rmse": 0.1332, "mae": 0.0987, "r2": 0.8654},
execution_time=12.4,
success=True,
)
Graceful Failure: If an experiment fails, the loop continues to the next iteration. Failed experiments are logged with success=False and error messages.
Step 4: Results Analysis (Gemini)
Location: src/orchestration/controller.py:315
The ResultsAnalyzer (src/cognitive/results_analyzer.py:84) analyzes the experiment result.
Local Computation (No Gemini):
- Metric comparison: current vs baseline, best, previous
- Percentage changes:
(baseline - current) / baseline * 100
- Trend detection: improving, degrading, plateau, fluctuating (last 3 experiments)
- Best model tracking: update if current > best
Gemini Analysis (with conversation history):
- Input: Current result + metric comparison + last 5 experiments
- Task: Generate 3-5 key observations explaining why the result occurred
- Output:
{
"key_observations": [
"XGBoost with lower learning rate (0.05) outperformed iteration 2's 0.1 by 10.3%",
"Log transformation of target variable remains critical — iteration 1 without it had RMSE 2x worse",
"Tree-based models (iterations 2-3) consistently outperform linear models (iteration 0)",
"Diminishing returns observed — improvement from iteration 2 to 3 is only 2.1%"
],
"reasoning": "The lower learning rate prevented overfitting while increased n_estimators maintained model capacity. However, we may be approaching the limit of this dataset's predictability."
}
Result: An AnalysisResult object (src/orchestration/state.py:152) with:
primary_metric: MetricComparison with all percentage changes
trend_pattern: TrendPattern enum (IMPROVING, PLATEAU, etc.)
key_observations: List of 3-5 strings
reasoning: Gemini’s detailed explanation
Step 5: Hypothesis Generation (Gemini)
Location: src/orchestration/controller.py:322
The HypothesisGenerator (src/cognitive/hypothesis_generator.py:79) generates 2-3 ranked hypotheses for the next iteration.
Input to Gemini:
- Analysis result (observations, trend, metric comparison)
- Last 5 experiments from history
- Current iteration number / max iterations
- Iterations without improvement count
- User constraints
Strategy Guidance: The prompt includes adaptive hints:
- If
trend == "plateau" or iteration > 70% of max: “Consider more exploratory hypotheses”
- If
trend == "improving" and early: “Balance refining current approach with trying alternatives”
- Default: “Balance exploration and exploitation”
Output:
{
"analysis_summary": "XGBoost shows strong performance but diminishing returns suggest we're near optimal for this approach",
"exploration_vs_exploitation": "balanced",
"hypotheses": [
{
"hypothesis_id": "h1",
"statement": "Fine-tune XGBoost regularization (alpha/lambda) to squeeze out remaining 1-2% improvement",
"rationale": "Current model shows slight overfitting based on train/test gap",
"suggested_model": "XGBRegressor",
"suggested_params": {"reg_alpha": 0.5, "reg_lambda": 1.0},
"confidence_score": 0.72,
"priority": 1
},
{
"hypothesis_id": "h2",
"statement": "Try LightGBM as alternative gradient booster with better handling of categorical features",
"rationale": "LightGBM often matches XGBoost performance with faster training",
"suggested_model": "LGBMRegressor",
"confidence_score": 0.65,
"priority": 2
}
],
"reasoning": "Focus on exploitation (h1) since we have a strong baseline, but keep one exploratory option (h2) in case LightGBM surprises us."
}
Result: A HypothesisSet object (src/orchestration/state.py:177) with ranked hypotheses.
The top hypothesis (priority=1) is automatically fed back into the next iteration’s design constraints (src/orchestration/controller.py:337). This creates a hypothesis → experiment → analysis → new hypothesis feedback loop.
Step 6: Termination Check
Location: src/orchestration/controller.py:143
After each iteration, the system checks if it should stop (src/orchestration/state.py:251):
should_stop, reason = self.state.should_terminate()
if should_stop:
self.state.termination_reason = reason
self.state.phase = ExperimentPhase.COMPLETED
break
Termination Criteria (evaluated in order):
- Max Iterations:
current_iteration >= max_iterations (default: 20)
- Time Budget:
elapsed_time > time_budget (default: 3600s)
- Plateau:
iterations_without_improvement >= plateau_threshold (default: 3)
- Target Achieved:
best_metric >= target_metric_value (from constraints)
- Agent Recommendation:
agent_recommends_stop == True (not currently used)
If none are met, the loop continues to the next iteration (back to Step 1).
Phase 4: Report Generation
Location: src/orchestration/controller.py:406
After the loop terminates, the ReportGenerator (src/cognitive/report_generator.py:86) creates a final Markdown report.
Report Structure
- Executive Summary (Gemini-generated): 1 paragraph summarizing the session
- Dataset Overview (local): Table with data profile statistics
- Methodology (Gemini-generated): 2-3 paragraphs explaining the approach
- Experiment Results (local): Table with all experiments and metrics
- Best Model (local): Detailed breakdown of the best model’s hyperparameters and metrics
- Key Insights (Gemini-generated): 3-5 bullet points with observations
- Visualizations (local): Embedded PNG charts (metric progression, model comparison, improvement)
- Recommendations (Gemini-generated): 3-5 bullet points for future work
- Appendix (local): Detailed per-experiment logs
Example Report Snippet
# ML Experiment Report: california_housing
## Executive Summary
This report summarizes an automated ML experiment session on the California Housing dataset for regression. A total of 20 experiments were conducted (18 successful), optimizing for RMSE. The best performing model was xgboost_tuned_regularization with RMSE = 0.1289. This represents an 82.4% improvement over the baseline LinearRegression model.
## Best Model
**Model**: XGBRegressor
**Experiment**: xgboost_tuned_regularization
**Iteration**: 17
**Primary Metric (rmse)**: 0.128943
**All Metrics**:
| Metric | Value |
|--------|-------|
| mae | 0.0876 |
| r2 | 0.8854 |
| rmse | 0.1289 |
**Hyperparameters**:
| Parameter | Value |
|-----------|-------|
| learning_rate | 0.05 |
| max_depth | 6 |
| n_estimators | 200 |
| reg_alpha | 0.5 |
Saved To: outputs/reports/report_california_housing_20260302_143022.md
State Management and Resumability
Location: src/orchestration/controller.py:454
After each major phase (profiling, baseline, each iteration), the complete ExperimentState is saved to JSON:
def save_state(self):
state_path = self.output_dir / f"state_{self.state.session_id}.json"
self.state.save(state_path)
State Contents
The saved state includes (src/orchestration/state.py:194):
- Session ID, config, phase
- Data profile
- All experiment results (every iteration)
- Best metric and experiment name
- Iterations without improvement count
- Start time, elapsed time
- Gemini conversation history (optional)
- Termination reason
Resuming a Session
If the process crashes or is interrupted (Ctrl+C), you can resume:
python -m src.main run \
--resume outputs/state_a3f9c2d1.json \
--verbose
The controller loads the state (src/orchestration/controller.py:94) and continues from the last saved phase.
Gemini conversation history is not saved to the state file (to reduce file size). Resumed sessions start with a fresh conversation, but all experiment results are preserved.
Real-World Example: 20-Iteration Session
Here’s what a typical session looks like:
[Phase 1: Data Profiling]
✓ Profiled 20,640 rows × 9 columns
✓ Target: MedHouseVal (continuous, right-skewed)
✓ Primary metric: rmse
[Phase 2: Baseline]
Iteration 0: baseline
Model: LinearRegression
RMSE: 0.7343 ← baseline
[Phase 3: Iteration Loop]
Iteration 1: random_forest_initial
Model: RandomForestRegressor
RMSE: 0.2456 (66.5% better than baseline) ← new best
Iteration 2: log_transform_target
Model: RandomForestRegressor
RMSE: 0.1421 (42.2% better than iteration 1) ← new best
Iteration 3: xgboost_initial
Model: XGBRegressor
RMSE: 0.1332 (6.3% better than iteration 2) ← new best
... iterations 4-16 ...
Iteration 17: xgboost_tuned_regularization
Model: XGBRegressor
RMSE: 0.1289 (3.2% better than previous best) ← new best
Iteration 18: lightgbm_alternative
Model: LGBMRegressor
RMSE: 0.1301 (no improvement)
Iteration 19: xgboost_final_tuning
Model: XGBRegressor
RMSE: 0.1291 (no improvement)
Iteration 20: ensemble_stacking
Model: StackingRegressor
RMSE: 0.1288 (0.08% better) ← marginal, diminishing returns
✓ Termination: Maximum iterations reached
[Phase 4: Finalization]
✓ Best model: xgboost_tuned_regularization (RMSE: 0.1289)
✓ Generated 3 visualizations
✓ Report saved: outputs/reports/report_california_housing_20260302_143022.md
✓ MLflow tracking: mlflow ui --backend-store-uri file:./outputs/mlruns
Key Takeaways
- Fully Autonomous: Once started, the loop runs without human intervention
- Hypothesis-Driven: Every experiment tests a specific hypothesis generated by Gemini
- Adaptive: Each iteration builds on all previous iterations via Thought Signatures
- Resilient: Failed experiments don’t stop the loop; state is saved for crash recovery
- Observable: Rich console output, MLflow tracking, JSON state files, final report
- Intelligent Termination: Stops when plateau detected, not just max iterations
Run with --verbose to see Gemini’s full reasoning at each step. You’ll see how it references previous iterations and builds increasingly sophisticated hypotheses.