Experiment Loop

The experiment loop is the core of ML Experiment Autopilot’s autonomous operation. It’s a continuous cycle where Gemini 3 designs experiments, the system executes them, analyzes results, and generates new hypotheses — all without human intervention.

High-Level Flow

Input: Dataset + Target + Task Type + (optional) Constraints
                    │
                    ▼
            ┌────────────────┐
            │ DATA PROFILING │  Analyze schema, distributions, missing values
            └───────┬────────┘
                    │
                    ▼
            ┌────────────────┐
            │ BASELINE MODEL │  Simple model to establish performance floor
            └───────┬────────┘
                    │
                    ▼
    ┌────────────────────────────────────┐
    │          ITERATION LOOP            │
    │                                    │
    │  1. Experiment Design (Gemini)     │  ← hypothesis, model, params
    │  2. Code Generation (Jinja2)       │  ← validated Python script
    │  3. Execution (subprocess)         │  ← train, evaluate, capture metrics
    │  4. Results Analysis (Gemini)      │  ← trends, comparisons, insights
    │  5. Hypothesis Generation (Gemini) │  ← ranked next steps
    │  6. Termination Check              │  ← continue or stop?
    │                                    │
    │  Repeat until termination...       │
    └───────────────┬────────────────────┘
                    │
                    ▼
          ┌───────────────────┐
          │ REPORT GENERATION │  Gemini writes narrative Markdown report
          └─────────┬─────────┘
                    │
                    ▼
Output: Best Model + Report + Visualizations + MLflow Experiment + Code

Phase 1: Data Profiling

Location: src/orchestration/controller.py:160 Before any experiments run, the system analyzes the dataset to understand its characteristics.

What Gets Profiled

The DataProfiler (src/execution/data_profiler.py) extracts:

Schema: Column names, data types (numeric vs categorical)
Missing Values: Count and percentage per column
Numeric Statistics: Mean, std, min, max, quartiles for each numeric column
Categorical Statistics: Unique values, most common values
Target Distribution:
- Regression: Mean, std, skewness, min/max
- Classification: Class counts, class balance

Example Profile Output

DataProfile(
    n_rows=20640,
    n_columns=9,
    numeric_columns=["MedInc", "HouseAge", "AveRooms", ...],
    categorical_columns=[],
    target_column="MedHouseVal",
    target_type="continuous",
    missing_values={"total_bedrooms": 207},
    target_stats={
        "mean": 2.07,
        "std": 1.15,
        "skew": 0.98,  # Right-skewed → consider log transform
    }
)

Why This Matters

The data profile is sent to Gemini with every experiment design request. It informs:

Missing value handling strategy (median vs mode vs drop)
Scaling choices (standard vs minmax vs none)
Target transformations (log for skewed regression targets)
Model selection (tree models handle missing values better than linear models)

The data profile is also logged to MLflow and displayed in the console when you run with --verbose.

Phase 2: Baseline Model

Location: src/orchestration/controller.py:194 The baseline establishes a performance floor — the minimum acceptable result.

Baseline Models

Task Type	Baseline Model	Reasoning
Regression	LinearRegression	Simplest possible model
Classification	LogisticRegression	Simplest possible model

No hyperparameter tuning — just default sklearn settings.

Baseline Result Example

Iteration 0: baseline
  Model: LinearRegression
  RMSE: 0.7343
  Execution Time: 0.8s

This becomes the baseline_value that all future experiments are compared against.

If the baseline fails (e.g., due to data issues), the loop stops immediately. A successful baseline is required to proceed.

Phase 3: Iteration Loop (Steps 1-6)

Location: src/orchestration/controller.py:245 This is where the magic happens. The loop repeats until a termination criterion is met.

Step 1: Experiment Design (Gemini)

Location: src/orchestration/controller.py:330 The ExperimentDesigner uses Gemini 3 to design the next experiment. Input to Gemini:

Data profile (schema, stats, missing values)
Last 5 experiment results (name, model, metrics, hypothesis, success/failure)
User constraints (if provided)
Top hypothesis from previous iteration (added to constraints)

Gemini’s Task:

Formulate a clear hypothesis for what to test
Select a model (avoid repeating unless with different params)
Choose hyperparameters
Specify preprocessing (missing values, scaling, encoding, target transform)
Explain the reasoning

Output:

{
  "experiment_name": "xgboost_tuned_learning_rate",
  "hypothesis": "Lower learning rate with more estimators may reduce overfitting",
  "model_type": "XGBRegressor",
  "model_params": {
    "n_estimators": 200,
    "learning_rate": 0.05,
    "max_depth": 6
  },
  "preprocessing": {
    "missing_values": "median",
    "scaling": "standard",
    "target_transform": "log"
  },
  "reasoning": "Previous iteration showed XGBoost overfit with learning_rate=0.1. Reducing to 0.05 and increasing n_estimators should improve generalization."
}

This JSON is parsed into an ExperimentSpec object (src/orchestration/state.py:58).

Hypothesis Context Injection: The top hypothesis from the previous iteration is automatically added to the constraints (src/orchestration/controller.py:336). This creates a feedback loop where each iteration builds on the last.

Step 2: Code Generation (Jinja2)

Location: src/orchestration/controller.py:271 The CodeGenerator (src/execution/code_generator.py) converts the ExperimentSpec into executable Python code using Jinja2 templates. Template Selection:

sklearn_regressor.py.jinja for LinearRegression, Ridge, RandomForestRegressor, etc.
sklearn_classifier.py.jinja for LogisticRegression, RandomForestClassifier, etc.
xgboost_model.py.jinja for XGBRegressor/XGBClassifier
lightgbm_model.py.jinja for LGBMRegressor/LGBMClassifier

Generated Script Structure:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
import json

# Load data
df = pd.read_csv("data/sample/california_housing.csv")

# Handle missing values (median imputation)
# ... preprocessing code ...

# Apply target transform (log)
y_train_transformed = np.log1p(y_train)
y_test_transformed = np.log1p(y_test)

# Train model
model = XGBRegressor(
    n_estimators=200,
    learning_rate=0.05,
    max_depth=6,
    random_state=42,
    n_jobs=-1
)
model.fit(X_train_scaled, y_train_transformed)

# Evaluate
y_pred_transformed = model.predict(X_test_scaled)
y_pred = np.expm1(y_pred_transformed)  # Inverse log transform

rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Output JSON
result = {
    "metrics": {"rmse": rmse, "mae": mae, "r2": r2},
    "model_path": "outputs/models/model_iter_3.pkl",
    "success": True
}
print(json.dumps(result))

Validation: The script is parsed with ast.parse() to ensure syntactic correctness before execution. Saved To: outputs/experiments/<session_id>/experiment_iter_3.py

Step 3: Execution (Subprocess)

Location: src/orchestration/controller.py:294 The ExperimentRunner (src/execution/experiment_runner.py) executes the generated script as a subprocess. Execution:

process = subprocess.run(
    ["python", script_path],
    capture_output=True,
    text=True,
    timeout=600,  # 10 minutes
)

Output Parsing: The script’s final line is JSON:

{"metrics": {"rmse": 0.1332, ...}, "success": true}

Error Handling: If the script fails (non-zero exit code, timeout, or invalid JSON), the runner creates an ExperimentResult with success=False and captures the error message. Result: An ExperimentResult object (src/orchestration/state.py:70):

ExperimentResult(
    experiment_name="xgboost_tuned_learning_rate",
    iteration=3,
    model_type="XGBRegressor",
    model_params={"n_estimators": 200, "learning_rate": 0.05, "max_depth": 6},
    metrics={"rmse": 0.1332, "mae": 0.0987, "r2": 0.8654},
    execution_time=12.4,
    success=True,
)

Graceful Failure: If an experiment fails, the loop continues to the next iteration. Failed experiments are logged with success=False and error messages.

Step 4: Results Analysis (Gemini)

Location: src/orchestration/controller.py:315 The ResultsAnalyzer (src/cognitive/results_analyzer.py:84) analyzes the experiment result. Local Computation (No Gemini):

Metric comparison: current vs baseline, best, previous
Percentage changes: (baseline - current) / baseline * 100
Trend detection: improving, degrading, plateau, fluctuating (last 3 experiments)
Best model tracking: update if current > best

Gemini Analysis (with conversation history):

Input: Current result + metric comparison + last 5 experiments
Task: Generate 3-5 key observations explaining why the result occurred
Output:

{
  "key_observations": [
    "XGBoost with lower learning rate (0.05) outperformed iteration 2's 0.1 by 10.3%",
    "Log transformation of target variable remains critical — iteration 1 without it had RMSE 2x worse",
    "Tree-based models (iterations 2-3) consistently outperform linear models (iteration 0)",
    "Diminishing returns observed — improvement from iteration 2 to 3 is only 2.1%"
  ],
  "reasoning": "The lower learning rate prevented overfitting while increased n_estimators maintained model capacity. However, we may be approaching the limit of this dataset's predictability."
}

Result: An AnalysisResult object (src/orchestration/state.py:152) with:

primary_metric: MetricComparison with all percentage changes
trend_pattern: TrendPattern enum (IMPROVING, PLATEAU, etc.)
key_observations: List of 3-5 strings
reasoning: Gemini’s detailed explanation

Step 5: Hypothesis Generation (Gemini)

Location: src/orchestration/controller.py:322 The HypothesisGenerator (src/cognitive/hypothesis_generator.py:79) generates 2-3 ranked hypotheses for the next iteration. Input to Gemini:

Analysis result (observations, trend, metric comparison)
Last 5 experiments from history
Current iteration number / max iterations
Iterations without improvement count
User constraints

Strategy Guidance: The prompt includes adaptive hints:

If trend == "plateau" or iteration > 70% of max: “Consider more exploratory hypotheses”
If trend == "improving" and early: “Balance refining current approach with trying alternatives”
Default: “Balance exploration and exploitation”

Output:

{
  "analysis_summary": "XGBoost shows strong performance but diminishing returns suggest we're near optimal for this approach",
  "exploration_vs_exploitation": "balanced",
  "hypotheses": [
    {
      "hypothesis_id": "h1",
      "statement": "Fine-tune XGBoost regularization (alpha/lambda) to squeeze out remaining 1-2% improvement",
      "rationale": "Current model shows slight overfitting based on train/test gap",
      "suggested_model": "XGBRegressor",
      "suggested_params": {"reg_alpha": 0.5, "reg_lambda": 1.0},
      "confidence_score": 0.72,
      "priority": 1
    },
    {
      "hypothesis_id": "h2",
      "statement": "Try LightGBM as alternative gradient booster with better handling of categorical features",
      "rationale": "LightGBM often matches XGBoost performance with faster training",
      "suggested_model": "LGBMRegressor",
      "confidence_score": 0.65,
      "priority": 2
    }
  ],
  "reasoning": "Focus on exploitation (h1) since we have a strong baseline, but keep one exploratory option (h2) in case LightGBM surprises us."
}

Result: A HypothesisSet object (src/orchestration/state.py:177) with ranked hypotheses.

The top hypothesis (priority=1) is automatically fed back into the next iteration’s design constraints (src/orchestration/controller.py:337). This creates a hypothesis → experiment → analysis → new hypothesis feedback loop.

Step 6: Termination Check

Location: src/orchestration/controller.py:143 After each iteration, the system checks if it should stop (src/orchestration/state.py:251):

should_stop, reason = self.state.should_terminate()
if should_stop:
    self.state.termination_reason = reason
    self.state.phase = ExperimentPhase.COMPLETED
    break

Termination Criteria (evaluated in order):

Max Iterations: current_iteration >= max_iterations (default: 20)
Time Budget: elapsed_time > time_budget (default: 3600s)
Plateau: iterations_without_improvement >= plateau_threshold (default: 3)
Target Achieved: best_metric >= target_metric_value (from constraints)
Agent Recommendation: agent_recommends_stop == True (not currently used)

If none are met, the loop continues to the next iteration (back to Step 1).

Phase 4: Report Generation

Location: src/orchestration/controller.py:406 After the loop terminates, the ReportGenerator (src/cognitive/report_generator.py:86) creates a final Markdown report.

Report Structure

Executive Summary (Gemini-generated): 1 paragraph summarizing the session
Dataset Overview (local): Table with data profile statistics
Methodology (Gemini-generated): 2-3 paragraphs explaining the approach
Experiment Results (local): Table with all experiments and metrics
Best Model (local): Detailed breakdown of the best model’s hyperparameters and metrics
Key Insights (Gemini-generated): 3-5 bullet points with observations
Visualizations (local): Embedded PNG charts (metric progression, model comparison, improvement)
Recommendations (Gemini-generated): 3-5 bullet points for future work
Appendix (local): Detailed per-experiment logs

Example Report Snippet

# ML Experiment Report: california_housing

## Executive Summary

This report summarizes an automated ML experiment session on the California Housing dataset for regression. A total of 20 experiments were conducted (18 successful), optimizing for RMSE. The best performing model was xgboost_tuned_regularization with RMSE = 0.1289. This represents an 82.4% improvement over the baseline LinearRegression model.

## Best Model

**Model**: XGBRegressor
**Experiment**: xgboost_tuned_regularization
**Iteration**: 17
**Primary Metric (rmse)**: 0.128943

**All Metrics**:
| Metric | Value |
|--------|-------|
| mae | 0.0876 |
| r2 | 0.8854 |
| rmse | 0.1289 |

**Hyperparameters**:
| Parameter | Value |
|-----------|-------|
| learning_rate | 0.05 |
| max_depth | 6 |
| n_estimators | 200 |
| reg_alpha | 0.5 |

Saved To: outputs/reports/report_california_housing_20260302_143022.md

State Management and Resumability

Location: src/orchestration/controller.py:454 After each major phase (profiling, baseline, each iteration), the complete ExperimentState is saved to JSON:

def save_state(self):
    state_path = self.output_dir / f"state_{self.state.session_id}.json"
    self.state.save(state_path)

State Contents

The saved state includes (src/orchestration/state.py:194):

Session ID, config, phase
Data profile
All experiment results (every iteration)
Best metric and experiment name
Iterations without improvement count
Start time, elapsed time
Gemini conversation history (optional)
Termination reason

Resuming a Session

If the process crashes or is interrupted (Ctrl+C), you can resume:

python -m src.main run \
  --resume outputs/state_a3f9c2d1.json \
  --verbose

The controller loads the state (src/orchestration/controller.py:94) and continues from the last saved phase.

Gemini conversation history is not saved to the state file (to reduce file size). Resumed sessions start with a fresh conversation, but all experiment results are preserved.

Real-World Example: 20-Iteration Session

Here’s what a typical session looks like:

[Phase 1: Data Profiling]
✓ Profiled 20,640 rows × 9 columns
✓ Target: MedHouseVal (continuous, right-skewed)
✓ Primary metric: rmse

[Phase 2: Baseline]
Iteration 0: baseline
  Model: LinearRegression
  RMSE: 0.7343 ← baseline

[Phase 3: Iteration Loop]
Iteration 1: random_forest_initial
  Model: RandomForestRegressor
  RMSE: 0.2456 (66.5% better than baseline) ← new best

Iteration 2: log_transform_target
  Model: RandomForestRegressor
  RMSE: 0.1421 (42.2% better than iteration 1) ← new best

Iteration 3: xgboost_initial
  Model: XGBRegressor
  RMSE: 0.1332 (6.3% better than iteration 2) ← new best

... iterations 4-16 ...

Iteration 17: xgboost_tuned_regularization
  Model: XGBRegressor
  RMSE: 0.1289 (3.2% better than previous best) ← new best

Iteration 18: lightgbm_alternative
  Model: LGBMRegressor
  RMSE: 0.1301 (no improvement)

Iteration 19: xgboost_final_tuning
  Model: XGBRegressor
  RMSE: 0.1291 (no improvement)

Iteration 20: ensemble_stacking
  Model: StackingRegressor
  RMSE: 0.1288 (0.08% better) ← marginal, diminishing returns

✓ Termination: Maximum iterations reached

[Phase 4: Finalization]
✓ Best model: xgboost_tuned_regularization (RMSE: 0.1289)
✓ Generated 3 visualizations
✓ Report saved: outputs/reports/report_california_housing_20260302_143022.md
✓ MLflow tracking: mlflow ui --backend-store-uri file:./outputs/mlruns

Key Takeaways

Fully Autonomous: Once started, the loop runs without human intervention
Hypothesis-Driven: Every experiment tests a specific hypothesis generated by Gemini
Adaptive: Each iteration builds on all previous iterations via Thought Signatures
Resilient: Failed experiments don’t stop the loop; state is saved for crash recovery
Observable: Rich console output, MLflow tracking, JSON state files, final report
Intelligent Termination: Stops when plateau detected, not just max iterations

Run with --verbose to see Gemini’s full reasoning at each step. You’ll see how it references previous iterations and builds increasingly sophisticated hypotheses.

Get Started

Core Concepts

CLI Reference

Guides

Examples

High-Level Flow

Phase 1: Data Profiling

What Gets Profiled

Example Profile Output

Why This Matters

Phase 2: Baseline Model

Baseline Models

Baseline Result Example

Phase 3: Iteration Loop (Steps 1-6)

Step 1: Experiment Design (Gemini)

Step 2: Code Generation (Jinja2)

Step 3: Execution (Subprocess)

Step 4: Results Analysis (Gemini)

Step 5: Hypothesis Generation (Gemini)

Step 6: Termination Check

Phase 4: Report Generation

Report Structure

Example Report Snippet

State Management and Resumability

State Contents

Resuming a Session

Real-World Example: 20-Iteration Session

Key Takeaways

Build docs developers (and LLMs) love

Get Started

Core Concepts

CLI Reference

Guides

Examples

​High-Level Flow

​Phase 1: Data Profiling

​What Gets Profiled

​Example Profile Output

​Why This Matters

​Phase 2: Baseline Model

​Baseline Models

​Baseline Result Example

​Phase 3: Iteration Loop (Steps 1-6)

​Step 1: Experiment Design (Gemini)

​Step 2: Code Generation (Jinja2)

​Step 3: Execution (Subprocess)

​Step 4: Results Analysis (Gemini)

​Step 5: Hypothesis Generation (Gemini)

​Step 6: Termination Check

​Phase 4: Report Generation

​Report Structure

​Example Report Snippet

​State Management and Resumability

​State Contents

​Resuming a Session

​Real-World Example: 20-Iteration Session

​Key Takeaways

Build docs developers (and LLMs) love

High-Level Flow

Phase 1: Data Profiling

What Gets Profiled

Example Profile Output

Why This Matters

Phase 2: Baseline Model

Baseline Models

Baseline Result Example

Phase 3: Iteration Loop (Steps 1-6)

Step 1: Experiment Design (Gemini)

Step 2: Code Generation (Jinja2)

Step 3: Execution (Subprocess)

Step 4: Results Analysis (Gemini)

Step 5: Hypothesis Generation (Gemini)

Step 6: Termination Check

Phase 4: Report Generation

Report Structure

Example Report Snippet

State Management and Resumability

State Contents

Resuming a Session

Real-World Example: 20-Iteration Session

Key Takeaways