Overview
ML Experiment Autopilot autonomously designs, executes, and iterates on machine learning experiments using Gemini 3. This guide covers how to run experiments effectively, from basic usage to advanced configurations.Basic Usage
Prepare Your Dataset
Ensure your dataset is in CSV or Parquet format with a clearly defined target column.
Set Up API Key
Configure your Gemini API key in the Get a free API key from Google AI Studio. Tier 1 or higher is recommended for better rate limits.
.env file:Command-Line Arguments
Required Arguments
Path to your dataset (CSV or Parquet file)
Target column name for prediction (case-sensitive)
Type of ML task:
classification or regressionOptional Arguments
Path to constraints file (Markdown format) to guide Gemini’s decisions
Maximum experiment iterations (1-100)
Time budget in seconds (60-86400)
Custom output directory (auto-generated if not specified)
Show detailed Gemini reasoning for each iteration
Resume from a saved state file (see Resuming Experiments)
Example: Classification Task
Running a classification experiment on the bank marketing dataset:- Data profiling analyzes 11,162 samples
- Baseline model (LogisticRegression) establishes performance floor
- Gemini designs 3 experiments testing different hypotheses
- Each iteration improves on previous results
- Final report generated in
outputs/reports/
Example: Regression Task
Running a regression experiment on California housing data:- Data profiling analyzes 20,640 samples with 8 features
- Constraints guide Gemini to prefer tree-based models and RMSE metric
- Baseline establishes RMSE ~0.75
- Iterations test hypotheses like log transforms, boosting methods
- Best model saved with metrics logged to MLflow
The Experiment Loop
Each iteration follows a structured process defined insrc/orchestration/controller.py:245-328:
Experiment Design (ExperimentDesigner)
Gemini analyzes:
- Data profile (schema, distributions, missing values)
- All previous experiment results
- User constraints and top hypotheses
ExperimentSpec with:- Model type and hyperparameters
- Preprocessing configuration
- Hypothesis being tested
- Reasoning for design choices
Code Generation (CodeGenerator)
Jinja2 templates generate validated Python scripts:
outputs/experiments/{session_id}/experiment_{iteration}.py- Code validated with
ast.parse()before execution - Supports sklearn, XGBoost, LightGBM models
Experiment Execution (ExperimentRunner)
Script runs in isolated subprocess:
- Timeout: 300 seconds (configurable)
- Captures stdout/stderr
- Parses JSON metrics output
- Handles failures gracefully
Results Analysis (ResultsAnalyzer)
Gemini compares metrics:
- Current vs. baseline (iteration 0)
- Current vs. best across all iterations
- Current vs. previous iteration
- Detects trends: improving, degrading, plateau, fluctuating
Hypothesis Generation (HypothesisGenerator)
Gemini synthesizes insights into ranked hypotheses:
- 1-3 hypotheses per iteration
- Confidence scores (0-1)
- Priority rankings (1=highest, 3=lowest)
- Exploration vs. exploitation strategy
Output Files
All outputs are saved tooutputs/ directory (or custom --output-dir):
| Output Type | Location | Description |
|---|---|---|
| Experiments | outputs/experiments/{session_id}/ | Generated Python scripts for each iteration |
| Reports | outputs/reports/ | Markdown report with full experimental journey |
| Plots | outputs/plots/ | Metric progression, model comparison charts |
| Models | outputs/models/ | Serialized best models (if saved) |
| MLflow | outputs/mlruns/ | MLflow tracking data (metrics, params, artifacts) |
| State | outputs/state_{session_id}.json | Session state for resuming |
Verbose Mode Output
With--verbose, you see Gemini’s reasoning in real-time:
Best Practices
Choose Appropriate Iterations
Choose Appropriate Iterations
- Quick exploration: 3-5 iterations
- Thorough search: 10-20 iterations
- Production-ready: 20-50 iterations with constraints
Set Realistic Time Budgets
Set Realistic Time Budgets
- Small datasets (<10K rows): 1800-3600 seconds (30-60 min)
- Medium datasets (10K-100K rows): 3600-7200 seconds (1-2 hours)
- Large datasets (>100K rows): 7200+ seconds (2+ hours)
Use Constraints Wisely
Use Constraints Wisely
Constraints guide Gemini without being overly restrictive. See Custom Constraints for examples.Good constraints:
- Specify primary metric (RMSE, F1, accuracy)
- Suggest model families (tree-based, linear, ensemble)
- Define preprocessing preferences
- Overly specific hyperparameters (let Gemini explore)
- Contradictory requirements
Monitor Resource Usage
Monitor Resource Usage
Each experiment runs in a subprocess with:
- Timeout: 300 seconds per experiment (configurable in
src/config.py:54) - Memory: Depends on dataset size and model
- Disk: Generated code + MLflow artifacts (~10-100 MB per session)
Next Steps
Understanding Results
Learn how to interpret metrics, trends, and analyses
MLflow Tracking
View experiments in the MLflow UI
Custom Constraints
Guide Gemini with natural language preferences
Resuming Experiments
Resume interrupted sessions