Documentation Index
Fetch the complete documentation index at: https://mintlify.com/neosigmaai/auto-harness/llms.txt
Use this file to discover all available pages before exploring further.
prepare.py is the one-time setup script that must be run before starting an auto-harness optimization experiment. It validates the environment, creates the workspace directory structure, copies the correct agent scaffold into agent/agent.py, composes PROGRAM.md, and runs a baseline benchmark to establish iteration 0. Re-running it on an existing workspace is safe — each step is idempotent and skips work that has already been done.
CLI usage
prepare.py takes no command-line arguments. All configuration is read from experiment_config.yaml in the current working directory.
Execution order
prepare.py runs the following steps in sequence:
Check environment
Calls the benchmark-specific environment check function (
check_env_tau_bench, check_env_terminal_bench, or check_env_bird_interact). For tau-bench, also validates that the tau2 data directory is populated via check_tau2_data.Initialize workspace
Calls
init_workspace(cfg) to create workspace/suite.json, workspace/learnings.md, workspace/results.tsv, and workspace/train_results.json.Copy agent template
Calls
copy_agent_template(benchmark) to install the benchmark-specific starting scaffold at agent/agent.py.Compose PROGRAM.md
Calls
copy_program_template(benchmark) to write PROGRAM.md from program_templates/base.md plus the benchmark supplement.Environment check functions
These functions validate that the required API keys and tools are present before any benchmark infrastructure is started. Each exits gracefully with an error message rather than raising an exception.check_env_tau_bench
agent_model in cfg:
- Models starting with
"gemini"→GEMINI_API_KEY - Models starting with
"claude"→ANTHROPIC_API_KEY - All other models →
OPENAI_API_KEY
True if all required environment variables are present, False otherwise.
check_env_terminal_bench
check_env_tau_bench) and the sandbox provider key:
env_provider="e2b"→E2B_API_KEYenv_provider="daytona"→DAYTONA_API_KEYenv_provider="docker"→ no key required
harbor CLI is on PATH.
Returns: True if all requirements are met, False otherwise.
init_workspace
workspace/ directory and initialize files if they do not already exist. All operations are idempotent.
Parsed
experiment_config.yaml dict. Used to read the threshold value for suite.json.| File | Initial content |
|---|---|
workspace/suite.json | {"tasks": [], "threshold": <cfg threshold or 0.8>, "last_results": {}} |
workspace/learnings.md | # Learnings\n\n |
workspace/results.tsv | TSV header: iteration\tval_score\tcommit\tevals_passed\tevals_total\ttimestamp |
workspace/train_results.json | {"split": null, "timestamp": null, "results": {}} |
copy_agent_template
agent/agent.py.
Benchmark identifier from
experiment_config.yaml. Must be one of "tau-bench", "terminal-bench", or "bird-interact".benchmark | Source |
|---|---|
"tau-bench" | agent/templates/tau_bench.py |
"terminal-bench" | agent/templates/terminal_bench.py |
"bird-interact" | agent/templates/bird_interact.py |
copy_program_template
PROGRAM.md by concatenating program_templates/base.md with the benchmark-specific supplement.
Benchmark identifier. Must be one of
"tau-bench", "terminal-bench", or "bird-interact".benchmark | Source |
|---|---|
"tau-bench" | program_templates/tau_bench.md |
"terminal-bench" | program_templates/terminal_bench.md |
"bird-interact" | program_templates/bird_interact.md |
PROGRAM.md contains the full base content followed by the benchmark-specific section. PROGRAM.md is listed in ALLOWED_AGENT_FILES and can be modified by the agent during iterations.
run_baseline
workspace/results.tsv. Skips execution entirely if results.tsv already contains data rows (i.e., a baseline has already been recorded).
Parsed
experiment_config.yaml dict.- tau-bench: Runs the test split directly and records the result.
- terminal-bench: If
tbench_data/task_split.jsondoes not yet exist, runs all tasks first to generate the split (viagenerate_terminal_bench_split), then records the test-split score. If the split already exists, runs only the test split. - bird-interact: Same logic as terminal-bench but using
bird_data/task_split.jsonandgenerate_bird_interact_split.
results.tsv uses "baseline" as the commit identifier and 0 for both evals_passed and evals_total.
generate_terminal_bench_split
tbench_data/task_split.json.
Tasks are split separately within the passing (reward >= 0.5) and failing (reward < 0.5) groups so that both splits have roughly the same pass/fail ratio. Tasks that timed out (reward None) are excluded by run_baseline before this function is called.
Mapping of task ID to reward from the all-tasks baseline run.
None values should be excluded before passing.Random seed for the shuffle, ensuring a reproducible split.
tbench_data/task_split.json:
generate_bird_interact_split
generate_terminal_bench_split but writes output to bird_data/task_split.json.
Mapping of instance ID to reward from the all-tasks BIRD-Interact baseline run.
Random seed for the shuffle.
fetch_tau2_data
data/tau2/ subdirectory into tau2_data_dir. Skips the clone if the data is already present.
Destination directory. The clone writes to a temporary
_tau2-bench-tmp subdirectory and then renames data/tau2/ into place, so interruptions do not leave a partially-populated destination.True on success or if data was already present, False if the clone or copy fails.
fetch_tau2_data is called automatically by check_tau2_data, which is called by prepare.py for tau-bench experiments. You do not need to call it directly in normal usage.When to re-run prepare.py
| Situation | Action |
|---|---|
| Starting a fresh experiment | Run prepare.py once |
| Switching to a different benchmark | Delete workspace/ and run prepare.py again |
| Resetting after a corrupted workspace | Delete workspace/ and run prepare.py again |
| Continuing an in-progress experiment | Do not re-run — existing results and workspace files are preserved |
| tau2 data directory is missing or corrupted | Delete tau2_data/ and re-run — prepare.py will re-clone it |