Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/neosigmaai/auto-harness/llms.txt

Use this file to discover all available pages before exploring further.

auto-harness ships with first-class support for three benchmarks that cover distinct problem classes: structured tool-call agents for customer service, bash-command agents for real-world terminal tasks, and multi-turn SQL agents backed by a live Postgres database. Each benchmark exposes per-task rewards and a train/test split, which are the two properties the optimization loop depends on.

Supported benchmarks

BenchmarkDomainTasksAgent interface
tau-benchCustomer service (retail, airline, telecom)retail: 114, airline: 50, telecom: 114Structured tool calls via tau2
Terminal-Bench 2.0Terminal tasks (coding, sysadmin, security)89Bash commands via Harbor containers
BIRD-InteractInteractive text-to-SQL (multi-turn CRUD over Postgres)lite: 300, full: 600Google ADK agent against a 3-service environment

Terminal-Bench 2.0

89 real-world terminal tasks across coding, sysadmin, and security. Agent executes bash commands in Harbor containers.

tau-bench

Customer service simulation across retail, airline, and telecom domains using structured tool calls via the tau2 API.

BIRD-Interact

Interactive text-to-SQL benchmark with multi-turn CRUD over Postgres. Runs a 3-service ADK environment per run.

Plug in your own

Subclass BenchmarkRunner and implement run() to add any benchmark that returns per-task rewards.

What makes a good benchmark for auto-harness

Not every benchmark is a good fit for automated optimization. auto-harness is designed around benchmarks that have two properties: Per-task rewards. The benchmark must return a scalar reward (0.0–1.0) for each task independently. This lets the harness calculate a val_score, identify exactly which tasks fail, and measure whether a change actually helped. A stable train/test split. The optimization loop trains on the train split and gates every proposed change against the test split. Without this separation, the coding agent could overfit to known tasks and the gating step would have no signal.

BenchmarkRunner: the common abstraction

All three benchmarks are implemented as subclasses of BenchmarkRunner in benchmark.py. The abstract base class has two methods:
class BenchmarkRunner(ABC):
    @abstractmethod
    def run(self, task_ids: list[str] | None = None) -> dict[str, float | None]:
        """Return {task_id: reward}. None means the task timed out."""

    def val_score(self, results: dict[str, float | None]) -> float:
        """Mean reward. None values count as 0.0."""
run() accepts an optional list of task IDs. Passing None runs all tasks in the configured split. val_score() computes the mean reward and treats timed-out tasks (None) as failures. Both gating.py and benchmark.py’s CLI use this interface directly, so the rest of the loop never needs to know which benchmark is active.

How the train/test split is generated

When you run python prepare.py for the first time on a fresh workspace, it executes the benchmark over all tasks with no split filter. After that baseline run, it generates a 70/30 train/test split using a stratified shuffle:
# From prepare.py — same logic for both TerminalBench and BirdInteract
passed = sorted(k for k, v in results.items() if v >= 0.5)
failed = sorted(k for k, v in results.items() if v < 0.5)

random.seed(42)
random.shuffle(passed)
random.shuffle(failed)

train_pass_n = int(len(passed) * 0.7)
train_fail_n = int(len(failed) * 0.7)
train = sorted(passed[:train_pass_n] + failed[:train_fail_n])
test  = sorted(passed[train_pass_n:] + failed[train_fail_n:])
The stratification ensures both splits have a representative mix of passing and failing tasks. The fixed seed (42) makes splits reproducible — deleting the split file and re-running prepare.py with the same baseline results produces the same split.
Tasks that time out during the baseline run are excluded from the split entirely. Including them would permanently drag down val_score with infrastructure noise rather than agent-quality signal.
tau-bench uses the split mechanism built into tau2 (task_split_name in TextRunConfig) rather than a local JSON file. Terminal-Bench stores its split at tbench_data/task_split.json and BIRD-Interact stores its split at bird_data/task_split.json.

Anti-cheating by design

The optimization loop enforces a strict information boundary: train traces are copied to workspace/traces/latest/ and workspace/traces/baseline/ after each run; test traces are never saved to disk. This is controlled by the HARNESS_SAVE_TRACE environment variable.
# From benchmark.py — TerminalBenchRunner.run()
if self.split != "train":
    env["HARNESS_SAVE_TRACE"] = "0"
The coding agent can only read workspace/traces/latest/. It has no path to test task traces, so gating on the test split is a genuine held-out evaluation.

Build docs developers (and LLMs) love