Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/neosigmaai/auto-harness/llms.txt

Use this file to discover all available pages before exploring further.

prepare.py is the one-time setup script that must be run before starting an auto-harness optimization experiment. It validates the environment, creates the workspace directory structure, copies the correct agent scaffold into agent/agent.py, composes PROGRAM.md, and runs a baseline benchmark to establish iteration 0. Re-running it on an existing workspace is safe — each step is idempotent and skips work that has already been done.

CLI usage

python prepare.py
prepare.py takes no command-line arguments. All configuration is read from experiment_config.yaml in the current working directory.
Run prepare.py only once per experiment. Running it again after iterations have been recorded will skip the workspace initialization and baseline run (because the files already exist), but it will overwrite agent/agent.py and PROGRAM.md with fresh templates. Only re-run it if you are intentionally resetting the experiment.

Execution order

prepare.py runs the following steps in sequence:
1

Load configuration

Reads experiment_config.yaml. Exits with an error if the file is missing.
2

Check environment

Calls the benchmark-specific environment check function (check_env_tau_bench, check_env_terminal_bench, or check_env_bird_interact). For tau-bench, also validates that the tau2 data directory is populated via check_tau2_data.
3

Initialize workspace

Calls init_workspace(cfg) to create workspace/suite.json, workspace/learnings.md, workspace/results.tsv, and workspace/train_results.json.
4

Copy agent template

Calls copy_agent_template(benchmark) to install the benchmark-specific starting scaffold at agent/agent.py.
5

Compose PROGRAM.md

Calls copy_program_template(benchmark) to write PROGRAM.md from program_templates/base.md plus the benchmark supplement.
6

Run baseline benchmark

Calls run_baseline(cfg) to execute the full benchmark and record iteration 0 in workspace/results.tsv.

Environment check functions

These functions validate that the required API keys and tools are present before any benchmark infrastructure is started. Each exits gracefully with an error message rather than raising an exception.

check_env_tau_bench

def check_env_tau_bench(cfg: dict) -> bool
Checks that the required LLM API key is set based on agent_model in cfg:
  • Models starting with "gemini"GEMINI_API_KEY
  • Models starting with "claude"ANTHROPIC_API_KEY
  • All other models → OPENAI_API_KEY
Returns: True if all required environment variables are present, False otherwise.

check_env_terminal_bench

def check_env_terminal_bench(cfg: dict) -> bool
Checks the LLM API key (same logic as check_env_tau_bench) and the sandbox provider key:
  • env_provider="e2b"E2B_API_KEY
  • env_provider="daytona"DAYTONA_API_KEY
  • env_provider="docker" → no key required
Also verifies that the harbor CLI is on PATH. Returns: True if all requirements are met, False otherwise.

init_workspace

def init_workspace(cfg: dict) -> None
Create the workspace/ directory and initialize files if they do not already exist. All operations are idempotent.
cfg
dict
required
Parsed experiment_config.yaml dict. Used to read the threshold value for suite.json.
Files created (only if missing):
FileInitial content
workspace/suite.json{"tasks": [], "threshold": <cfg threshold or 0.8>, "last_results": {}}
workspace/learnings.md# Learnings\n\n
workspace/results.tsvTSV header: iteration\tval_score\tcommit\tevals_passed\tevals_total\ttimestamp
workspace/train_results.json{"split": null, "timestamp": null, "results": {}}

copy_agent_template

def copy_agent_template(benchmark: str) -> None
Copy the benchmark-specific agent scaffold into agent/agent.py.
benchmark
str
required
Benchmark identifier from experiment_config.yaml. Must be one of "tau-bench", "terminal-bench", or "bird-interact".
Template source files:
benchmarkSource
"tau-bench"agent/templates/tau_bench.py
"terminal-bench"agent/templates/terminal_bench.py
"bird-interact"agent/templates/bird_interact.py
Exits with an error if the template file does not exist.

copy_program_template

def copy_program_template(benchmark: str) -> None
Compose PROGRAM.md by concatenating program_templates/base.md with the benchmark-specific supplement.
benchmark
str
required
Benchmark identifier. Must be one of "tau-bench", "terminal-bench", or "bird-interact".
Supplement source files:
benchmarkSource
"tau-bench"program_templates/tau_bench.md
"terminal-bench"program_templates/terminal_bench.md
"bird-interact"program_templates/bird_interact.md
The output PROGRAM.md contains the full base content followed by the benchmark-specific section. PROGRAM.md is listed in ALLOWED_AGENT_FILES and can be modified by the agent during iterations.

run_baseline

def run_baseline(cfg: dict) -> None
Run the baseline benchmark and record iteration 0 in workspace/results.tsv. Skips execution entirely if results.tsv already contains data rows (i.e., a baseline has already been recorded).
cfg
dict
required
Parsed experiment_config.yaml dict.
Behavior varies by benchmark:
  • tau-bench: Runs the test split directly and records the result.
  • terminal-bench: If tbench_data/task_split.json does not yet exist, runs all tasks first to generate the split (via generate_terminal_bench_split), then records the test-split score. If the split already exists, runs only the test split.
  • bird-interact: Same logic as terminal-bench but using bird_data/task_split.json and generate_bird_interact_split.
The baseline row written to results.tsv uses "baseline" as the commit identifier and 0 for both evals_passed and evals_total.

generate_terminal_bench_split

def generate_terminal_bench_split(results: dict[str, float], seed: int = 42) -> None
Generate a stratified 70/30 train/test split from baseline results and write it to tbench_data/task_split.json. Tasks are split separately within the passing (reward >= 0.5) and failing (reward < 0.5) groups so that both splits have roughly the same pass/fail ratio. Tasks that timed out (reward None) are excluded by run_baseline before this function is called.
results
dict[str, float]
required
Mapping of task ID to reward from the all-tasks baseline run. None values should be excluded before passing.
seed
int
default:"42"
Random seed for the shuffle, ensuring a reproducible split.
Output written to tbench_data/task_split.json:
{
  "train": ["task-a", "task-b", "..."],
  "test":  ["task-c", "task-d", "..."],
  "metadata": {
    "created": "2024-11-05T14:00:00+00:00",
    "total_tasks": 120,
    "seed": 42
  }
}

generate_bird_interact_split

def generate_bird_interact_split(results: dict[str, float], seed: int = 42) -> None
Identical logic to generate_terminal_bench_split but writes output to bird_data/task_split.json.
results
dict[str, float]
required
Mapping of instance ID to reward from the all-tasks BIRD-Interact baseline run.
seed
int
default:"42"
Random seed for the shuffle.

fetch_tau2_data

def fetch_tau2_data(tau2_data_dir: str) -> bool
Clone the tau2-bench repository and copy the data/tau2/ subdirectory into tau2_data_dir. Skips the clone if the data is already present.
tau2_data_dir
str
required
Destination directory. The clone writes to a temporary _tau2-bench-tmp subdirectory and then renames data/tau2/ into place, so interruptions do not leave a partially-populated destination.
Returns: True on success or if data was already present, False if the clone or copy fails.
fetch_tau2_data is called automatically by check_tau2_data, which is called by prepare.py for tau-bench experiments. You do not need to call it directly in normal usage.

When to re-run prepare.py

SituationAction
Starting a fresh experimentRun prepare.py once
Switching to a different benchmarkDelete workspace/ and run prepare.py again
Resetting after a corrupted workspaceDelete workspace/ and run prepare.py again
Continuing an in-progress experimentDo not re-run — existing results and workspace files are preserved
tau2 data directory is missing or corruptedDelete tau2_data/ and re-run — prepare.py will re-clone it

Build docs developers (and LLMs) love