Skip to main content

Overview

ML Experiment Autopilot automatically saves session state throughout execution. If an experiment is interrupted (Ctrl+C, system crash, timeout), you can resume from the last saved checkpoint without losing progress.

How State Saving Works

State is saved automatically at key points (defined in src/orchestration/controller.py:454-460):
1

After Data Profiling

State saved when profiling completes:
self.save_state()  # Line 187
2

After Baseline Model

State saved after baseline experiment:
self.save_state()  # Line 239
3

After Each Iteration

State saved after every experiment iteration:
self.save_state()  # Line 328
4

On Keyboard Interrupt

State saved when you press Ctrl+C:
except KeyboardInterrupt:
    controller.save_state()  # src/main.py:195
5

At Completion

Final state saved when loop completes:
self.save_state()  # Line 450

State File Format

State is saved as JSON in outputs/state_{session_id}.json:
{
  "session_id": "a1b2c3d4",
  "config": {
    "data_path": "data/sample/california_housing.csv",
    "target_column": "MedHouseVal",
    "task_type": "regression",
    "constraints": "# Experiment Constraints\n...",
    "max_iterations": 20,
    "time_budget": 3600,
    "plateau_threshold": 3,
    "improvement_threshold": 0.005,
    "primary_metric": "rmse",
    "output_dir": "outputs"
  },
  "data_profile": {
    "n_rows": 20640,
    "n_columns": 9,
    "columns": ["MedInc", "HouseAge", ...],
    "numeric_stats": { ... },
    "target_stats": { ... }
  },
  "experiments": [
    {
      "experiment_name": "baseline",
      "iteration": 0,
      "model_type": "LinearRegression",
      "metrics": {"rmse": 0.7456, "r2": 0.6012},
      "success": true,
      "timestamp": "2026-03-02T10:00:05.123456"
    },
    {
      "experiment_name": "log_transform_rf",
      "iteration": 1,
      "model_type": "RandomForestRegressor",
      "metrics": {"rmse": 0.4201, "r2": 0.7834},
      "hypothesis": "Log transformation will reduce target skewness",
      "success": true,
      "timestamp": "2026-03-02T10:00:18.456789"
    }
  ],
  "current_iteration": 2,
  "phase": "experiment_design",
  "best_metric": 0.4201,
  "best_experiment": "log_transform_rf",
  "iterations_without_improvement": 0,
  "start_time": 1709377200.5,
  "end_time": null,
  "gemini_conversation_history": [
    {
      "role": "user",
      "content": "Design the next experiment...",
      "timestamp": "2026-03-02T10:00:10.123456"
    },
    {
      "role": "model",
      "content": "{\"experiment_name\": \"log_transform_rf\", ...}",
      "timestamp": "2026-03-02T10:00:12.789012"
    }
  ],
  "agent_recommends_stop": false,
  "termination_reason": null
}
State includes the complete Gemini conversation history, maintaining Thought Signature continuity when resuming.

Resuming a Session

Basic Resume

1

Locate State File

Find the state file in the outputs directory:
ls outputs/state_*.json
# outputs/state_a1b2c3d4.json
2

Resume with --resume Flag

Use the --resume flag with the state file path:
python -m src.main run \
  --data data/sample/california_housing.csv \
  --target MedHouseVal \
  --task regression \
  --resume outputs/state_a1b2c3d4.json \
  --verbose
You must still provide --data, --target, and --task arguments, but they are validated against the saved state.
3

Verify Resume

The autopilot prints confirmation:
✓ Resumed from outputs/state_a1b2c3d4.json
Session ID: a1b2c3d4
Current iteration: 2
Best metric (rmse): 0.4201

What Gets Restored

When resuming (implemented in src/orchestration/controller.py:92-95):
Configuration
restored
All original settings:
  • Data path, target column, task type
  • Constraints text
  • Max iterations, time budget
  • Primary metric
Data Profile
restored
Complete dataset analysis (skips re-profiling)
Experiment History
restored
All completed experiments:
  • Model types, parameters, preprocessing
  • Metrics, hypotheses, reasoning
  • Success/failure status
  • Timestamps
Best Tracking
restored
Current best:
  • Best metric value
  • Best experiment name
  • Iterations without improvement
Gemini Context
restored
Full conversation history for Thought Signature continuity:
  • All user prompts
  • All model responses
  • Timestamps
This is crucial! Gemini maintains reasoning continuity across the interruption.
Phase
restored
Exact phase when interrupted:
  • data_profiling, baseline_modeling, experiment_design, etc.
  • Resumes from next logical step

What Does NOT Get Restored

The following are not saved and must be re-specified:
  • --verbose flag (defaults to False)
  • --output-dir (uses original from state)
  • MLflow tracker state (auto-reconnects to same experiment)

Resuming After Different Interruptions

Scenario 1: Keyboard Interrupt (Ctrl+C)

# Original run
python -m src.main run \
  --data data/sample/california_housing.csv \
  --target MedHouseVal \
  --task regression \
  --max-iterations 20 \
  --verbose

# ... (3 iterations complete)
# Press Ctrl+C

^C
Interrupted by user. Saving state...
 State saved. You can resume with --resume
Resume:
python -m src.main run \
  --data data/sample/california_housing.csv \
  --target MedHouseVal \
  --task regression \
  --resume outputs/state_a1b2c3d4.json \
  --verbose

# Continues from iteration 4

Scenario 2: System Crash or Kill

If the process crashes or is killed:
# State is saved after each iteration
# Resume from last completed iteration

python -m src.main run \
  --data data/sample/california_housing.csv \
  --target MedHouseVal \
  --task regression \
  --resume outputs/state_a1b2c3d4.json
If the crash occurred during an iteration (not between iterations), that iteration’s results are lost and will be re-run.

Scenario 3: Time Budget Exhausted

If the time budget runs out:
# Original run with 30-minute budget
python -m src.main run \
  --data data/sample/california_housing.csv \
  --target MedHouseVal \
  --task regression \
  --time-budget 1800 \
  --max-iterations 20

# ... (time expires after 5 iterations)
Time budget exhausted
 Results saved to outputs
Resume with extended budget:
python -m src.main run \
  --data data/sample/california_housing.csv \
  --target MedHouseVal \
  --task regression \
  --resume outputs/state_a1b2c3d4.json \
  --time-budget 3600  # Additional 1 hour
The --time-budget for resumed runs is additional time, not total time from the original start.

Modifying Resume Behavior

Extend Iterations

Increase max iterations when resuming:
# Original: 10 iterations
python -m src.main run ... --max-iterations 10

# Resume with 20 total iterations
python -m src.main run ... --resume state.json --max-iterations 20
The --max-iterations value when resuming sets a new maximum, not additional iterations.

Change Verbosity

# Original run without verbose
python -m src.main run ... --max-iterations 10

# Resume with verbose output
python -m src.main run ... --resume state.json --verbose

Cannot Change

These settings are locked to the original values (enforced by state validation):
  • --data (must match original)
  • --target (must match original)
  • --task (must match original)
  • --constraints (loaded from state)
  • --output-dir (loaded from state)

State File Management

Finding State Files

State files are named by session ID:
# List all state files
ls -lh outputs/state_*.json

# Find state files from today
find outputs -name "state_*.json" -mtime -1

# Find state by session ID (if you know it)
ls outputs/state_a1b2c3d4.json

Cleaning Up Old States

State files persist indefinitely. Clean up manually:
# Remove states older than 30 days
find outputs -name "state_*.json" -mtime +30 -delete

# Archive old states
mkdir -p archive
mv outputs/state_*.json archive/

Backing Up State

For long-running experiments, back up state periodically:
# Copy state to backup location
cp outputs/state_a1b2c3d4.json backups/state_a1b2c3d4_$(date +%Y%m%d_%H%M%S).json

# Or sync entire outputs directory
rsync -av outputs/ backups/outputs_backup/

Troubleshooting Resume Issues

Error: FileNotFoundError: outputs/state_a1b2c3d4.jsonCauses:
  • State file moved or deleted
  • Incorrect path provided to --resume
  • State not saved (process killed before first save)
Solution:
# Check if state exists
ls -lh outputs/state_*.json

# Use correct path
python -m src.main run ... --resume outputs/state_a1b2c3d4.json

# If lost, restart from scratch (no --resume)
Error: AssertionError: Data path mismatchCause: CLI arguments don’t match saved stateSolution: Use exact same --data, --target, --task as original:
# Check state file for original values
cat outputs/state_a1b2c3d4.json | grep -A5 '"config"'

# Match CLI arguments to state
python -m src.main run \
  --data data/sample/california_housing.csv \  # From state
  --target MedHouseVal \                        # From state
  --task regression \                           # From state
  --resume outputs/state_a1b2c3d4.json
Observation: Resumed session seems to “forget” previous reasoningCause: State file corrupted or conversation history truncatedSolution:
# Validate state JSON
python -c "import json; print(json.load(open('outputs/state_a1b2c3d4.json')))" > /dev/null

# Check conversation history length
cat outputs/state_a1b2c3d4.json | jq '.gemini_conversation_history | length'

# If corrupted, restart from scratch
Observation: Resumed session creates new MLflow experimentCause: MLflow experiment name mismatchSolution: MLflow auto-reconnects using session_id. If this fails:
# Check MLflow experiments
mlflow ui --backend-store-uri file:./outputs/mlruns

# Look for experiment named: autopilot_{dataset}_{session_id}
# e.g., autopilot_california_housing_a1b2c3d4
Resumed runs append to the same MLflow experiment.
Observation: Resume seems to restart from beginningCause: State file from a completed (not interrupted) sessionSolution: Check state phase:
cat outputs/state_a1b2c3d4.json | jq '.phase'
# "completed" means session finished naturally

# Cannot resume completed sessions
# Start a new session instead

Advanced: Manual State Editing

Editing state files manually is risky and can corrupt the session. Only do this if you understand the state schema.
You can manually edit state JSON for advanced use cases:

Extend Max Iterations

# Edit state file
vim outputs/state_a1b2c3d4.json

# Change:
"config": {
  "max_iterations": 10,  # Change to 20
  ...
}

# Save and resume
python -m src.main run ... --resume outputs/state_a1b2c3d4.json

Remove Failed Experiments

If an experiment failed due to transient error:
# Edit state file
vim outputs/state_a1b2c3d4.json

# Remove the failed experiment from "experiments" array
# Decrement "current_iteration"

# Resume to retry that iteration

Change Primary Metric

# Edit state file
vim outputs/state_a1b2c3d4.json

# Change:
"config": {
  "primary_metric": "rmse",  # Change to "r2"
  ...
}

# This affects how "best" is determined going forward

Best Practices

Session IDs are auto-generated (8-char UUIDs). For easier management:
# After run completes, rename state file
mv outputs/state_a1b2c3d4.json outputs/state_housing_exp1.json

# Resume with descriptive name
python -m src.main run ... --resume outputs/state_housing_exp1.json
For multi-hour experiments:
# Run in screen/tmux for persistence
screen -S autopilot
python -m src.main run ... --max-iterations 50 --time-budget 14400

# Detach: Ctrl+A, D
# Reattach: screen -r autopilot
State saves every iteration, so you can safely detach/reattach.
For reproducibility:
# After each resume
cp outputs/state_a1b2c3d4.json versioned_states/state_v2.json

# Track in git (if not too large)
git add versioned_states/state_v2.json
git commit -m "Experiment checkpoint after 10 iterations"

Next Steps

Running Experiments

Learn about the full experiment loop

Troubleshooting

Resolve common issues

Build docs developers (and LLMs) love