Quickstart

Prerequisites

Before you begin, ensure you have:

Python 3.9 or higher installed
A Gemini API key (get one free here)
A dataset in CSV or Parquet format

The free tier works, but Tier 1 or higher is recommended for production use to avoid rate limits during long-running experiments.

Quick Setup

Clone the repository

git clone https://github.com/srikar161720/ml-experiment-autopilot.git
cd ml-experiment-autopilot

Create virtual environment

python3 -m venv venv
source venv/bin/activate

Install dependencies

pip install -r requirements.txt

This will install:

google-generativeai - Gemini 3 API client
pandas, numpy - Data processing
scikit-learn, xgboost, lightgbm - ML models
mlflow - Experiment tracking
typer, rich - CLI interface
pydantic - Data validation

Configure API key

Copy the example environment file:

cp .env.example .env

Edit .env and add your Gemini API key:

.env

GEMINI_API_KEY=your_actual_gemini_api_key_here

Run Your First Experiment

The repository includes sample datasets to help you get started immediately.

Regression Example

Run an experiment on the California Housing dataset (20,640 samples):

python -m src.main run \
  --data data/sample/california_housing.csv \
  --target MedHouseVal \
  --task regression \
  --max-iterations 3 \
  --verbose

Use --verbose (or -v) to see Gemini’s detailed reasoning at each step. This helps you understand how the agent makes decisions.

Classification Example

Run an experiment on the Bank Marketing dataset (11,162 samples):

python -m src.main run \
  --data data/sample/bank.csv \
  --target deposit \
  --task classification \
  --max-iterations 3 \
  --verbose

Understanding the Output

Console Output

With --verbose enabled, you’ll see Gemini’s reasoning process in real-time:

╔══════════════════════════════════════════════════════════════╗
║  ITERATION 3 - GEMINI'S REASONING                            ║
║  Thought Signature Active | Context: 12 turns                ║
╚══════════════════════════════════════════════════════════════╝

Based on the previous 2 experiments, I've observed that:
- Tree-based models consistently outperform linear models on this dataset
- Iteration 2's log-transform hypothesis improved RMSE by 80%
- Feature distributions suggest boosting may capture residual patterns

For this iteration, I'm testing XGBoost with tuned learning rate
and max_depth to see if gradient boosting further reduces error...

┌─────────────────────────────────────────────────────────────┐
│ RESULTS ANALYSIS                                            │
├─────────────────────────────────────────────────────────────┤
│ Trend: IMPROVING                                            │
│ RMSE: 0.1332   ★ NEW BEST                                   │
│   82.1% better than baseline                                │
│                                                             │
│ Key Observations:                                           │
│   - Boosting provided 10.3% improvement over bagging        │
│   - Log transformation remains critical for this target     │
│   - Diminishing returns suggest stopping after next round   │
└─────────────────────────────────────────────────────────────┘

Generated Outputs

All experiment artifacts are saved to the outputs/ directory:

Output	Location	Description
Markdown reports	`outputs/reports/`	Publication-ready narrative reports
Visualizations	`outputs/plots/`	Metric progression, model comparison charts
Generated code	`outputs/experiments/`	Python scripts for each experiment
MLflow data	`outputs/mlruns/`	Experiment tracking database
Saved models	`outputs/models/`	Serialized best models

View Results in MLflow

MLflow provides a web dashboard to explore all experiments:

mlflow ui --backend-store-uri file:./outputs/mlruns

Then open http://127.0.0.1:5000 in your browser. The MLflow UI shows:

All experiment runs with metrics
Parameter comparisons
Artifact downloads (models, plots, code)
Run metadata and timing

Experiment Loop

The autopilot follows this iterative process:

Data Profiling

Analyzes schema, distributions, missing values, and statistical properties.

# From src/execution/data_profiler.py
profiler = DataProfiler(data_path, target_column, task_type)
profile = profiler.profile()

Baseline Model

Establishes a performance floor with a simple model (e.g., mean prediction for regression).

Experiment Design

Gemini designs the next experiment based on:

Data profile
Previous experiment results
User constraints (if provided)
Hypothesis to test

Code Generation

Generates validated Python code using Jinja2 templates:

# From src/execution/code_generator.py
generator = CodeGenerator()
code = generator.generate(
    spec=experiment_spec,
    data_path=data_path,
    output_dir=output_dir
)

Execution

Runs the generated code in a subprocess with timeout protection.

Results Analysis

Gemini analyzes results, compares against baseline and best, and detects trends.

Hypothesis Generation

Generates ranked hypotheses for the next iteration with confidence scores.

Termination Check

Decides whether to continue based on:

Max iterations reached
Performance plateau (3 consecutive non-improvements by default)
Time budget exceeded
Target metric achieved (if specified in constraints)

Using Your Own Dataset

To run the autopilot on your own data:

python -m src.main run \
  --data /path/to/your/dataset.csv \
  --target your_target_column \
  --task regression \
  --max-iterations 10 \
  --verbose

Ensure your target column name is spelled exactly as it appears in the dataset (case-sensitive).

Supported File Formats

CSV: .csv files (automatically detected)
Parquet: .parquet files (automatically detected)

Task Types

regression - For continuous target variables (house prices, temperature, etc.)
classification - For categorical target variables (yes/no, spam/ham, etc.)

Adding Constraints

Guide Gemini’s experimentation with natural language preferences:

python -m src.main run \
  --data data/sample/california_housing.csv \
  --target MedHouseVal \
  --task regression \
  --constraints my_constraints.md \
  --max-iterations 5 \
  --verbose

Create a constraints file in Markdown:

my_constraints.md

# Experiment Constraints

## Metrics
- Primary metric: RMSE

## Models
- Prefer tree-based models
- Prefer boosting methods

## Preprocessing
- Log-transform the target variable
- Use median imputation for missing values

## Termination
- Stop if no improvement for 3 iterations

Constraints are natural language guidelines, not hard rules. Gemini will consider them when designing experiments but may deviate if it has good reasoning to do so.

Resuming Interrupted Experiments

If an experiment is interrupted (Ctrl+C or crash), the state is automatically saved:

python -m src.main run \
  --resume outputs/state_20260302_143022.json

The autopilot will continue from where it left off, preserving all conversation context.

Next Steps

CLI Reference

Complete list of all command-line options

Advanced Constraints

Learn how to write powerful constraint files

Understanding Reports

How to interpret generated Markdown reports

Troubleshooting

Common issues and solutions

Get Started

Core Concepts

CLI Reference

Guides

Examples

Prerequisites

Quick Setup

Run Your First Experiment

Regression Example

Classification Example

Understanding the Output

Console Output

Generated Outputs

View Results in MLflow

Experiment Loop

Using Your Own Dataset

Supported File Formats

Task Types

Adding Constraints

Resuming Interrupted Experiments

Next Steps

CLI Reference

Advanced Constraints

Understanding Reports

Troubleshooting

Build docs developers (and LLMs) love

Get Started

Core Concepts

CLI Reference

Guides

Examples

​Prerequisites

​Quick Setup

​Run Your First Experiment

​Regression Example

​Classification Example

​Understanding the Output

​Console Output

​Generated Outputs

​View Results in MLflow

​Experiment Loop

​Using Your Own Dataset

​Supported File Formats

​Task Types

​Adding Constraints

​Resuming Interrupted Experiments

​Next Steps

CLI Reference

Advanced Constraints

Understanding Reports

Troubleshooting

Build docs developers (and LLMs) love

Prerequisites

Quick Setup

Run Your First Experiment

Regression Example

Classification Example

Understanding the Output

Console Output

Generated Outputs

View Results in MLflow

Experiment Loop

Using Your Own Dataset

Supported File Formats

Task Types

Adding Constraints

Resuming Interrupted Experiments

Next Steps