auto-harness ships with first-class support for three benchmarks that cover distinct problem classes: structured tool-call agents for customer service, bash-command agents for real-world terminal tasks, and multi-turn SQL agents backed by a live Postgres database. Each benchmark exposes per-task rewards and a train/test split, which are the two properties the optimization loop depends on.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/neosigmaai/auto-harness/llms.txt
Use this file to discover all available pages before exploring further.
Supported benchmarks
| Benchmark | Domain | Tasks | Agent interface |
|---|---|---|---|
| tau-bench | Customer service (retail, airline, telecom) | retail: 114, airline: 50, telecom: 114 | Structured tool calls via tau2 |
| Terminal-Bench 2.0 | Terminal tasks (coding, sysadmin, security) | 89 | Bash commands via Harbor containers |
| BIRD-Interact | Interactive text-to-SQL (multi-turn CRUD over Postgres) | lite: 300, full: 600 | Google ADK agent against a 3-service environment |
Terminal-Bench 2.0
89 real-world terminal tasks across coding, sysadmin, and security. Agent executes bash commands in Harbor containers.
tau-bench
Customer service simulation across retail, airline, and telecom domains using structured tool calls via the tau2 API.
BIRD-Interact
Interactive text-to-SQL benchmark with multi-turn CRUD over Postgres. Runs a 3-service ADK environment per run.
Plug in your own
Subclass
BenchmarkRunner and implement run() to add any benchmark that returns per-task rewards.What makes a good benchmark for auto-harness
Not every benchmark is a good fit for automated optimization. auto-harness is designed around benchmarks that have two properties: Per-task rewards. The benchmark must return a scalar reward (0.0–1.0) for each task independently. This lets the harness calculate aval_score, identify exactly which tasks fail, and measure whether a change actually helped.
A stable train/test split. The optimization loop trains on the train split and gates every proposed change against the test split. Without this separation, the coding agent could overfit to known tasks and the gating step would have no signal.
BenchmarkRunner: the common abstraction
All three benchmarks are implemented as subclasses ofBenchmarkRunner in benchmark.py. The abstract base class has two methods:
run() accepts an optional list of task IDs. Passing None runs all tasks in the configured split. val_score() computes the mean reward and treats timed-out tasks (None) as failures. Both gating.py and benchmark.py’s CLI use this interface directly, so the rest of the loop never needs to know which benchmark is active.
How the train/test split is generated
When you runpython prepare.py for the first time on a fresh workspace, it executes the benchmark over all tasks with no split filter. After that baseline run, it generates a 70/30 train/test split using a stratified shuffle:
42) makes splits reproducible — deleting the split file and re-running prepare.py with the same baseline results produces the same split.
Tasks that time out during the baseline run are excluded from the split entirely. Including them would permanently drag down
val_score with infrastructure noise rather than agent-quality signal.task_split_name in TextRunConfig) rather than a local JSON file. Terminal-Bench stores its split at tbench_data/task_split.json and BIRD-Interact stores its split at bird_data/task_split.json.
Anti-cheating by design
The optimization loop enforces a strict information boundary: train traces are copied toworkspace/traces/latest/ and workspace/traces/baseline/ after each run; test traces are never saved to disk. This is controlled by the HARNESS_SAVE_TRACE environment variable.
workspace/traces/latest/. It has no path to test task traces, so gating on the test split is a genuine held-out evaluation.