Benchmarks overview: tau-bench, Terminal-Bench, BIRD

auto-harness ships with first-class support for three benchmarks that cover distinct problem classes: structured tool-call agents for customer service, bash-command agents for real-world terminal tasks, and multi-turn SQL agents backed by a live Postgres database. Each benchmark exposes per-task rewards and a train/test split, which are the two properties the optimization loop depends on.

Supported benchmarks

Benchmark	Domain	Tasks	Agent interface
tau-bench	Customer service (retail, airline, telecom)	retail: 114, airline: 50, telecom: 114	Structured tool calls via tau2
Terminal-Bench 2.0	Terminal tasks (coding, sysadmin, security)	89	Bash commands via Harbor containers
BIRD-Interact	Interactive text-to-SQL (multi-turn CRUD over Postgres)	lite: 300, full: 600	Google ADK agent against a 3-service environment

Terminal-Bench 2.0

89 real-world terminal tasks across coding, sysadmin, and security. Agent executes bash commands in Harbor containers.

tau-bench

Customer service simulation across retail, airline, and telecom domains using structured tool calls via the tau2 API.

BIRD-Interact

Interactive text-to-SQL benchmark with multi-turn CRUD over Postgres. Runs a 3-service ADK environment per run.

Plug in your own

Subclass BenchmarkRunner and implement run() to add any benchmark that returns per-task rewards.

What makes a good benchmark for auto-harness

Not every benchmark is a good fit for automated optimization. auto-harness is designed around benchmarks that have two properties: Per-task rewards. The benchmark must return a scalar reward (0.0–1.0) for each task independently. This lets the harness calculate a val_score, identify exactly which tasks fail, and measure whether a change actually helped. A stable train/test split. The optimization loop trains on the train split and gates every proposed change against the test split. Without this separation, the coding agent could overfit to known tasks and the gating step would have no signal.

BenchmarkRunner: the common abstraction

All three benchmarks are implemented as subclasses of BenchmarkRunner in benchmark.py. The abstract base class has two methods:

class BenchmarkRunner(ABC):
    @abstractmethod
    def run(self, task_ids: list[str] | None = None) -> dict[str, float | None]:
        """Return {task_id: reward}. None means the task timed out."""

    def val_score(self, results: dict[str, float | None]) -> float:
        """Mean reward. None values count as 0.0."""

run() accepts an optional list of task IDs. Passing None runs all tasks in the configured split. val_score() computes the mean reward and treats timed-out tasks (None) as failures. Both gating.py and benchmark.py’s CLI use this interface directly, so the rest of the loop never needs to know which benchmark is active.

How the train/test split is generated

When you run python prepare.py for the first time on a fresh workspace, it executes the benchmark over all tasks with no split filter. After that baseline run, it generates a 70/30 train/test split using a stratified shuffle:

# From prepare.py — same logic for both TerminalBench and BirdInteract
passed = sorted(k for k, v in results.items() if v >= 0.5)
failed = sorted(k for k, v in results.items() if v < 0.5)

random.seed(42)
random.shuffle(passed)
random.shuffle(failed)

train_pass_n = int(len(passed) * 0.7)
train_fail_n = int(len(failed) * 0.7)
train = sorted(passed[:train_pass_n] + failed[:train_fail_n])
test  = sorted(passed[train_pass_n:] + failed[train_fail_n:])

The stratification ensures both splits have a representative mix of passing and failing tasks. The fixed seed (42) makes splits reproducible — deleting the split file and re-running prepare.py with the same baseline results produces the same split.

Tasks that time out during the baseline run are excluded from the split entirely. Including them would permanently drag down val_score with infrastructure noise rather than agent-quality signal.

tau-bench uses the split mechanism built into tau2 (task_split_name in TextRunConfig) rather than a local JSON file. Terminal-Bench stores its split at tbench_data/task_split.json and BIRD-Interact stores its split at bird_data/task_split.json.

Anti-cheating by design

The optimization loop enforces a strict information boundary: train traces are copied to workspace/traces/latest/ and workspace/traces/baseline/ after each run; test traces are never saved to disk. This is controlled by the HARNESS_SAVE_TRACE environment variable.

# From benchmark.py — TerminalBenchRunner.run()
if self.split != "train":
    env["HARNESS_SAVE_TRACE"] = "0"

The coding agent can only read workspace/traces/latest/. It has no path to test task traces, so gating on the test split is a genuine held-out evaluation.

Get Started

Core Concepts

Benchmarks

Extending

Benchmarks overview: tau-bench, Terminal-Bench, BIRD

Supported benchmarks

Terminal-Bench 2.0

tau-bench

BIRD-Interact

Plug in your own

What makes a good benchmark for auto-harness

BenchmarkRunner: the common abstraction

How the train/test split is generated

Anti-cheating by design

Build docs developers (and LLMs) love

Get Started

Core Concepts

Benchmarks

Extending

Documentation Index

​Supported benchmarks

Terminal-Bench 2.0

tau-bench

BIRD-Interact

Plug in your own

​What makes a good benchmark for auto-harness

​BenchmarkRunner: the common abstraction

​How the train/test split is generated

​Anti-cheating by design

Build docs developers (and LLMs) love

Supported benchmarks

What makes a good benchmark for auto-harness

BenchmarkRunner: the common abstraction

How the train/test split is generated

Anti-cheating by design