ML Experiment Autopilot automatically saves session state throughout execution. If an experiment is interrupted (Ctrl+C, system crash, timeout), you can resume from the last saved checkpoint without losing progress.
# State is saved after each iteration# Resume from last completed iterationpython -m src.main run \ --data data/sample/california_housing.csv \ --target MedHouseVal \ --task regression \ --resume outputs/state_a1b2c3d4.json
If the crash occurred during an iteration (not between iterations), that iteration’s results are lost and will be re-run.
# Original run without verbosepython -m src.main run ... --max-iterations 10# Resume with verbose outputpython -m src.main run ... --resume state.json --verbose
# List all state filesls -lh outputs/state_*.json# Find state files from todayfind outputs -name "state_*.json" -mtime -1# Find state by session ID (if you know it)ls outputs/state_a1b2c3d4.json
State not saved (process killed before first save)
Solution:
# Check if state existsls -lh outputs/state_*.json# Use correct pathpython -m src.main run ... --resume outputs/state_a1b2c3d4.json# If lost, restart from scratch (no --resume)
State validation fails
Error: AssertionError: Data path mismatchCause: CLI arguments don’t match saved stateSolution: Use exact same --data, --target, --task as original:
# Check state file for original valuescat outputs/state_a1b2c3d4.json | grep -A5 '"config"'# Match CLI arguments to statepython -m src.main run \ --data data/sample/california_housing.csv \ # From state --target MedHouseVal \ # From state --task regression \ # From state --resume outputs/state_a1b2c3d4.json
Gemini context lost
Observation: Resumed session seems to “forget” previous reasoningCause: State file corrupted or conversation history truncatedSolution:
# Validate state JSONpython -c "import json; print(json.load(open('outputs/state_a1b2c3d4.json')))" > /dev/null# Check conversation history lengthcat outputs/state_a1b2c3d4.json | jq '.gemini_conversation_history | length'# If corrupted, restart from scratch
MLflow tracking disconnected
Observation: Resumed session creates new MLflow experimentCause: MLflow experiment name mismatchSolution: MLflow auto-reconnects using session_id. If this fails:
# Edit state filevim outputs/state_a1b2c3d4.json# Change:"config": { "max_iterations": 10, # Change to 20 ...}# Save and resumepython -m src.main run ... --resume outputs/state_a1b2c3d4.json
# Edit state filevim outputs/state_a1b2c3d4.json# Remove the failed experiment from "experiments" array# Decrement "current_iteration"# Resume to retry that iteration
# Edit state filevim outputs/state_a1b2c3d4.json# Change:"config": { "primary_metric": "rmse", # Change to "r2" ...}# This affects how "best" is determined going forward
Session IDs are auto-generated (8-char UUIDs). For easier management:
# After run completes, rename state filemv outputs/state_a1b2c3d4.json outputs/state_housing_exp1.json# Resume with descriptive namepython -m src.main run ... --resume outputs/state_housing_exp1.json
Checkpoint Long Runs
For multi-hour experiments:
# Run in screen/tmux for persistencescreen -S autopilotpython -m src.main run ... --max-iterations 50 --time-budget 14400# Detach: Ctrl+A, D# Reattach: screen -r autopilot
State saves every iteration, so you can safely detach/reattach.
Version Control State Files
For reproducibility:
# After each resumecp outputs/state_a1b2c3d4.json versioned_states/state_v2.json# Track in git (if not too large)git add versioned_states/state_v2.jsongit commit -m "Experiment checkpoint after 10 iterations"