Skip to main content

Overview

ML Experiment Autopilot autonomously designs, executes, and iterates on machine learning experiments using Gemini 3. This guide covers how to run experiments effectively, from basic usage to advanced configurations.

Basic Usage

1

Prepare Your Dataset

Ensure your dataset is in CSV or Parquet format with a clearly defined target column.
# Example dataset structure
data/
└── my_dataset.csv  # Must have headers and a target column
2

Set Up API Key

Configure your Gemini API key in the .env file:
cp .env.example .env
# Edit .env and add:
GEMINI_API_KEY=your_actual_key_here
Get a free API key from Google AI Studio. Tier 1 or higher is recommended for better rate limits.
3

Run Your First Experiment

Use the run command with required arguments:
python -m src.main run \
  --data data/sample/california_housing.csv \
  --target MedHouseVal \
  --task regression \
  --max-iterations 5 \
  --verbose
Always run as a module using python -m src.main to avoid import errors.

Command-Line Arguments

Required Arguments

--data, -d
path
required
Path to your dataset (CSV or Parquet file)
--data data/sample/california_housing.csv
--target, -t
string
required
Target column name for prediction (case-sensitive)
--target MedHouseVal
--task
enum
required
Type of ML task: classification or regression
--task regression

Optional Arguments

--constraints, -c
path
Path to constraints file (Markdown format) to guide Gemini’s decisions
--constraints data/sample/constraints.md
--max-iterations, -n
integer
default:"20"
Maximum experiment iterations (1-100)
--max-iterations 10
--time-budget
integer
default:"3600"
Time budget in seconds (60-86400)
--time-budget 7200  # 2 hours
--output-dir, -o
path
Custom output directory (auto-generated if not specified)
--output-dir ./my_results
--verbose, -v
boolean
default:"false"
Show detailed Gemini reasoning for each iteration
--verbose
--resume
path
Resume from a saved state file (see Resuming Experiments)
--resume outputs/state_a1b2c3d4.json

Example: Classification Task

Running a classification experiment on the bank marketing dataset:
python -m src.main run \
  --data data/sample/bank.csv \
  --target deposit \
  --task classification \
  --max-iterations 3 \
  --verbose
What happens:
  1. Data profiling analyzes 11,162 samples
  2. Baseline model (LogisticRegression) establishes performance floor
  3. Gemini designs 3 experiments testing different hypotheses
  4. Each iteration improves on previous results
  5. Final report generated in outputs/reports/

Example: Regression Task

Running a regression experiment on California housing data:
python -m src.main run \
  --data data/sample/california_housing.csv \
  --target MedHouseVal \
  --task regression \
  --constraints data/sample/constraints.md \
  --max-iterations 5 \
  --verbose
What happens:
  1. Data profiling analyzes 20,640 samples with 8 features
  2. Constraints guide Gemini to prefer tree-based models and RMSE metric
  3. Baseline establishes RMSE ~0.75
  4. Iterations test hypotheses like log transforms, boosting methods
  5. Best model saved with metrics logged to MLflow

The Experiment Loop

Each iteration follows a structured process defined in src/orchestration/controller.py:245-328:
1

Experiment Design (ExperimentDesigner)

Gemini analyzes:
  • Data profile (schema, distributions, missing values)
  • All previous experiment results
  • User constraints and top hypotheses
Outputs a structured ExperimentSpec with:
  • Model type and hyperparameters
  • Preprocessing configuration
  • Hypothesis being tested
  • Reasoning for design choices
2

Code Generation (CodeGenerator)

Jinja2 templates generate validated Python scripts:
  • outputs/experiments/{session_id}/experiment_{iteration}.py
  • Code validated with ast.parse() before execution
  • Supports sklearn, XGBoost, LightGBM models
3

Experiment Execution (ExperimentRunner)

Script runs in isolated subprocess:
  • Timeout: 300 seconds (configurable)
  • Captures stdout/stderr
  • Parses JSON metrics output
  • Handles failures gracefully
4

Results Analysis (ResultsAnalyzer)

Gemini compares metrics:
  • Current vs. baseline (iteration 0)
  • Current vs. best across all iterations
  • Current vs. previous iteration
  • Detects trends: improving, degrading, plateau, fluctuating
5

Hypothesis Generation (HypothesisGenerator)

Gemini synthesizes insights into ranked hypotheses:
  • 1-3 hypotheses per iteration
  • Confidence scores (0-1)
  • Priority rankings (1=highest, 3=lowest)
  • Exploration vs. exploitation strategy
6

Termination Check

Loop continues unless:
  • Max iterations reached (--max-iterations)
  • Time budget exhausted (--time-budget)
  • Performance plateau (3 iterations without 0.5% improvement)
  • Target metric achieved (via constraints)
  • Agent recommends stopping

Output Files

All outputs are saved to outputs/ directory (or custom --output-dir):
Output TypeLocationDescription
Experimentsoutputs/experiments/{session_id}/Generated Python scripts for each iteration
Reportsoutputs/reports/Markdown report with full experimental journey
Plotsoutputs/plots/Metric progression, model comparison charts
Modelsoutputs/models/Serialized best models (if saved)
MLflowoutputs/mlruns/MLflow tracking data (metrics, params, artifacts)
Stateoutputs/state_{session_id}.jsonSession state for resuming

Verbose Mode Output

With --verbose, you see Gemini’s reasoning in real-time:
╔══════════════════════════════════════════════════════════════╗
║  ITERATION 3 - GEMINI'S REASONING                            ║
║  Thought Signature Active | Context: 12 turns                ║
╚══════════════════════════════════════════════════════════════╝

Based on the previous 2 experiments, I've observed that:
- Tree-based models consistently outperform linear models on this dataset
- Iteration 2's log-transform hypothesis improved RMSE by 80%
- Feature distributions suggest boosting may capture residual patterns

For this iteration, I'm testing XGBoost with tuned learning rate
and max_depth to see if gradient boosting further reduces error...
Verbose mode is excellent for understanding Gemini’s decision-making process and debugging unexpected results.

Best Practices

  • Quick exploration: 3-5 iterations
  • Thorough search: 10-20 iterations
  • Production-ready: 20-50 iterations with constraints
More iterations allow Gemini to explore diverse strategies and find better solutions.
  • Small datasets (<10K rows): 1800-3600 seconds (30-60 min)
  • Medium datasets (10K-100K rows): 3600-7200 seconds (1-2 hours)
  • Large datasets (>100K rows): 7200+ seconds (2+ hours)
Time budget includes profiling, all iterations, and report generation.
Constraints guide Gemini without being overly restrictive. See Custom Constraints for examples.Good constraints:
  • Specify primary metric (RMSE, F1, accuracy)
  • Suggest model families (tree-based, linear, ensemble)
  • Define preprocessing preferences
Avoid:
  • Overly specific hyperparameters (let Gemini explore)
  • Contradictory requirements
Each experiment runs in a subprocess with:
  • Timeout: 300 seconds per experiment (configurable in src/config.py:54)
  • Memory: Depends on dataset size and model
  • Disk: Generated code + MLflow artifacts (~10-100 MB per session)

Next Steps

Understanding Results

Learn how to interpret metrics, trends, and analyses

MLflow Tracking

View experiments in the MLflow UI

Custom Constraints

Guide Gemini with natural language preferences

Resuming Experiments

Resume interrupted sessions

Build docs developers (and LLMs) love