prepare.py CLI Reference — auto-harness Experiment Setup

prepare.py is the one-time setup script that must be run before starting an auto-harness optimization experiment. It validates the environment, creates the workspace directory structure, copies the correct agent scaffold into agent/agent.py, composes PROGRAM.md, and runs a baseline benchmark to establish iteration 0. Re-running it on an existing workspace is safe — each step is idempotent and skips work that has already been done.

CLI usage

python prepare.py

prepare.py takes no command-line arguments. All configuration is read from experiment_config.yaml in the current working directory.

Run prepare.py only once per experiment. Running it again after iterations have been recorded will skip the workspace initialization and baseline run (because the files already exist), but it will overwrite agent/agent.py and PROGRAM.md with fresh templates. Only re-run it if you are intentionally resetting the experiment.

Execution order

prepare.py runs the following steps in sequence:

Load configuration

Reads experiment_config.yaml. Exits with an error if the file is missing.

Check environment

Calls the benchmark-specific environment check function (check_env_tau_bench, check_env_terminal_bench, or check_env_bird_interact). For tau-bench, also validates that the tau2 data directory is populated via check_tau2_data.

Initialize workspace

Calls init_workspace(cfg) to create workspace/suite.json, workspace/learnings.md, workspace/results.tsv, and workspace/train_results.json.

Copy agent template

Calls copy_agent_template(benchmark) to install the benchmark-specific starting scaffold at agent/agent.py.

Compose PROGRAM.md

Calls copy_program_template(benchmark) to write PROGRAM.md from program_templates/base.md plus the benchmark supplement.

Run baseline benchmark

Calls run_baseline(cfg) to execute the full benchmark and record iteration 0 in workspace/results.tsv.

Environment check functions

These functions validate that the required API keys and tools are present before any benchmark infrastructure is started. Each exits gracefully with an error message rather than raising an exception.

`check_env_tau_bench`

def check_env_tau_bench(cfg: dict) -> bool

Checks that the required LLM API key is set based on agent_model in cfg:

Models starting with "gemini" → GEMINI_API_KEY
Models starting with "claude" → ANTHROPIC_API_KEY
All other models → OPENAI_API_KEY

Returns: True if all required environment variables are present, False otherwise.

`check_env_terminal_bench`

def check_env_terminal_bench(cfg: dict) -> bool

Checks the LLM API key (same logic as check_env_tau_bench) and the sandbox provider key:

env_provider="e2b" → E2B_API_KEY
env_provider="daytona" → DAYTONA_API_KEY
env_provider="docker" → no key required

Also verifies that the harbor CLI is on PATH. Returns: True if all requirements are met, False otherwise.

`init_workspace`

def init_workspace(cfg: dict) -> None

Create the workspace/ directory and initialize files if they do not already exist. All operations are idempotent.

cfg

dict

required

Parsed experiment_config.yaml dict. Used to read the threshold value for suite.json.

Files created (only if missing):

File	Initial content
`workspace/suite.json`	`{"tasks": [], "threshold": <cfg threshold or 0.8>, "last_results": {}}`
`workspace/learnings.md`	`# Learnings\n\n`
`workspace/results.tsv`	TSV header: `iteration\tval_score\tcommit\tevals_passed\tevals_total\ttimestamp`
`workspace/train_results.json`	`{"split": null, "timestamp": null, "results": {}}`

`copy_agent_template`

def copy_agent_template(benchmark: str) -> None

Copy the benchmark-specific agent scaffold into agent/agent.py.

benchmark

str

required

Benchmark identifier from experiment_config.yaml. Must be one of "tau-bench", "terminal-bench", or "bird-interact".

Template source files:

`benchmark`	Source
`"tau-bench"`	`agent/templates/tau_bench.py`
`"terminal-bench"`	`agent/templates/terminal_bench.py`
`"bird-interact"`	`agent/templates/bird_interact.py`

Exits with an error if the template file does not exist.

`copy_program_template`

def copy_program_template(benchmark: str) -> None

Compose PROGRAM.md by concatenating program_templates/base.md with the benchmark-specific supplement.

benchmark

str

required

Benchmark identifier. Must be one of "tau-bench", "terminal-bench", or "bird-interact".

Supplement source files:

`benchmark`	Source
`"tau-bench"`	`program_templates/tau_bench.md`
`"terminal-bench"`	`program_templates/terminal_bench.md`
`"bird-interact"`	`program_templates/bird_interact.md`

The output PROGRAM.md contains the full base content followed by the benchmark-specific section. PROGRAM.md is listed in ALLOWED_AGENT_FILES and can be modified by the agent during iterations.

`run_baseline`

def run_baseline(cfg: dict) -> None

Run the baseline benchmark and record iteration 0 in workspace/results.tsv. Skips execution entirely if results.tsv already contains data rows (i.e., a baseline has already been recorded).

cfg

dict

required

Parsed experiment_config.yaml dict.

Behavior varies by benchmark:

tau-bench: Runs the test split directly and records the result.
terminal-bench: If tbench_data/task_split.json does not yet exist, runs all tasks first to generate the split (via generate_terminal_bench_split), then records the test-split score. If the split already exists, runs only the test split.
bird-interact: Same logic as terminal-bench but using bird_data/task_split.json and generate_bird_interact_split.

The baseline row written to results.tsv uses "baseline" as the commit identifier and 0 for both evals_passed and evals_total.

`generate_terminal_bench_split`

def generate_terminal_bench_split(results: dict[str, float], seed: int = 42) -> None

Generate a stratified 70/30 train/test split from baseline results and write it to tbench_data/task_split.json. Tasks are split separately within the passing (reward >= 0.5) and failing (reward < 0.5) groups so that both splits have roughly the same pass/fail ratio. Tasks that timed out (reward None) are excluded by run_baseline before this function is called.

results

dict[str, float]

required

Mapping of task ID to reward from the all-tasks baseline run. None values should be excluded before passing.

seed

int

default:"42"

Random seed for the shuffle, ensuring a reproducible split.

Output written to tbench_data/task_split.json:

{
  "train": ["task-a", "task-b", "..."],
  "test":  ["task-c", "task-d", "..."],
  "metadata": {
    "created": "2024-11-05T14:00:00+00:00",
    "total_tasks": 120,
    "seed": 42
  }
}

`generate_bird_interact_split`

def generate_bird_interact_split(results: dict[str, float], seed: int = 42) -> None

Identical logic to generate_terminal_bench_split but writes output to bird_data/task_split.json.

results

dict[str, float]

required

Mapping of instance ID to reward from the all-tasks BIRD-Interact baseline run.

seed

int

default:"42"

Random seed for the shuffle.

`fetch_tau2_data`

def fetch_tau2_data(tau2_data_dir: str) -> bool

Clone the tau2-bench repository and copy the data/tau2/ subdirectory into tau2_data_dir. Skips the clone if the data is already present.

tau2_data_dir

str

required

Destination directory. The clone writes to a temporary _tau2-bench-tmp subdirectory and then renames data/tau2/ into place, so interruptions do not leave a partially-populated destination.

Returns: True on success or if data was already present, False if the clone or copy fails.

fetch_tau2_data is called automatically by check_tau2_data, which is called by prepare.py for tau-bench experiments. You do not need to call it directly in normal usage.

When to re-run prepare.py

Situation	Action
Starting a fresh experiment	Run `prepare.py` once
Switching to a different benchmark	Delete `workspace/` and run `prepare.py` again
Resetting after a corrupted workspace	Delete `workspace/` and run `prepare.py` again
Continuing an in-progress experiment	Do not re-run — existing results and workspace files are preserved
tau2 data directory is missing or corrupted	Delete `tau2_data/` and re-run — `prepare.py` will re-clone it

Configuration

API Reference

prepare.py CLI Reference — auto-harness Experiment Setup

CLI usage

Execution order

Environment check functions

`check_env_tau_bench`

`check_env_terminal_bench`

`init_workspace`

`copy_agent_template`

`copy_program_template`

`run_baseline`

`generate_terminal_bench_split`

`generate_bird_interact_split`

`fetch_tau2_data`

When to re-run prepare.py

Build docs developers (and LLMs) love

Configuration

API Reference

Documentation Index

​CLI usage

​Execution order

​Environment check functions

​check_env_tau_bench

​check_env_terminal_bench

​init_workspace

​copy_agent_template

​copy_program_template

​run_baseline

​generate_terminal_bench_split

​generate_bird_interact_split

​fetch_tau2_data

​When to re-run prepare.py

Build docs developers (and LLMs) love

CLI usage

Execution order

Environment check functions

`check_env_tau_bench`

`check_env_terminal_bench`

`init_workspace`

`copy_agent_template`

`copy_program_template`

`run_baseline`

`generate_terminal_bench_split`

`generate_bird_interact_split`

`fetch_tau2_data`

When to re-run prepare.py