auto-harness is built around a single abstract class:Documentation Index
Fetch the complete documentation index at: https://mintlify.com/neosigmaai/auto-harness/llms.txt
Use this file to discover all available pages before exploring further.
BenchmarkRunner. Every supported benchmark — tau-bench, Terminal-Bench 2.0, BIRD-Interact — is just a concrete subclass that implements one method. If your benchmark can run tasks and produce per-task rewards, you can integrate it in four focused steps without touching any of the harness logic that drives the optimization loop.
The BenchmarkRunner abstract base class
BenchmarkRunner lives in benchmark.py and defines the entire contract between your benchmark and the harness:
What run() must return
The return type is dict[str, float | None]:
- Keys are task ID strings — they must match the IDs used throughout the harness (
suite.json,train_results.json, trace directories). - Values are reward floats in the range
[0.0, 1.0], orNoneif the task did not produce a verifier result (typically a timeout or infrastructure error).
None is a meaningful sentinel — it signals that the task ran but the verifier never produced a score, which is distinct from the agent actively failing. The harness reports timed-out tasks separately and they count as 0.0 in val_score.
How val_score() is computed
val_score is the mean reward across all results. None values are converted to 0.0 before averaging:
gating.py compares against the best recorded score in workspace/results.tsv during Step 2 of the gate.
Subclassing BenchmarkRunner
The minimal implementation from the README:
Always fill in
None for tasks that were requested but produced no result. The gating step uses task_ids as the denominator, so silently dropping a task is different from returning None — a dropped task disappears from the pass-rate calculation, while None counts as a failure.Integration steps
Subclass BenchmarkRunner in benchmark.py
Add your class to
benchmark.py. Follow the same pattern as TauBenchRunner, TerminalBenchRunner, or BirdInteractRunner: accept configuration in __init__, implement run(), and optionally copy train traces to workspace/traces/ so the coding agent can read failure traces.Import the new class at the top of gating.py:Add a branch in gating.py's _create_runners()
_create_runners() reads experiment_config.yaml and instantiates a train runner and a gate runner. Add a branch for your benchmark name:workspace/train_results.json and to run the regression suite (Step 1). The gate runner runs the test split to produce the val_score checked in Step 2.Add a branch in prepare.py's __main__
prepare.py handles environment checks, workspace initialization, template copying, and the baseline run. Add your benchmark to each relevant section:generate_terminal_bench_split() does in the terminal-bench path.Create templates in agent/templates/ and program_templates/
The coding agent needs a starting-point implementation and benchmark-specific loop instructions:
agent/templates/my_benchmark.py— theHarnessAgentclass tailored to your benchmark’s interface. See the next section for whatHarnessAgentmust look like.program_templates/my_benchmark.md— guidance appended toPROGRAM.md, covering: trace file paths, task ID format, known techniques for your benchmark, and a diff command to compare the currentagent/agent.pyagainst the template.
copy_agent_template() and copy_program_template() in prepare.py:What HarnessAgent must implement
The coding agent editsagent/agent.py every iteration. benchmark.py imports HarnessAgent directly from that file, so the interface your runner expects is the interface HarnessAgent must satisfy.
The exact interface depends on which framework your benchmark uses. Looking at the three existing templates:
-
tau-bench (
tau_bench.py) —HarnessAgentextendsLLMAgentfrom thetau2library. It implementssystem_prompt,get_init_state(), andgenerate_next_message(). The tau-bench runner receivesHarnessAgentvia thetau2registry and calls these methods. -
Terminal-Bench 2.0 (
terminal_bench.py) —HarnessAgentextendsBaseAgentfromharbor.agents.base. It implementsname(),version(),setup(), andrun(instruction, environment, context). The Harbor framework instantiates the class via--agent-import-path agent.agent:HarnessAgentand callsrun()per task. -
BIRD-Interact (
bird_interact.py) —HarnessAgentis not a class the runner instantiates directly. Instead,agent.pyexports abuild_agent(mode)function that returns a Google ADKAgent. The harness wraps that agent as a FastAPI service.
MyBenchmarkRunner.run() imports and calls. The coding agent optimizes the internals (system prompt, tool definitions, loop logic) without being required to know about benchmark.py or gating.py.
The loop, gating, and workspace are benchmark-agnostic
Once your runner, gating branch, and templates are in place, the entire harness works as-is for your benchmark:The optimization loop (PROGRAM.md)
The optimization loop (PROGRAM.md)
prepare.py composes PROGRAM.md from program_templates/base.md + your benchmark-specific supplement. The coding agent reads this file to understand the run → analyze → improve → gate → record → repeat loop. The loop itself never changes between benchmarks.Three-step gating (gating.py)
Three-step gating (gating.py)
run_gate() calls your train and gate runners through the BenchmarkRunner interface. Step 0 checks for disallowed file edits. Step 1 re-runs the regression suite tasks. Step 2 compares val_score to the best seen in results.tsv. Step 3 promotes newly-passing tasks. None of this logic is benchmark-specific.Result recording (record.py)
Result recording (record.py)
record.py appends a row to workspace/results.tsv. It never calls run() — it just records the val_score and commit that were passed to it after a successful gate. Format is identical regardless of benchmark.Workspace structure
Workspace structure
workspace/suite.json, workspace/results.tsv, workspace/train_results.json, and workspace/learnings.md have fixed schemas that are written and read by the harness infrastructure. Your runner writes per-task results; the harness does everything else.Structural anti-cheating
Structural anti-cheating
Test traces are never saved to disk.
TerminalBenchRunner checks self.split != "train" before copying traces. Follow the same pattern in your runner to prevent the coding agent from reading test traces and overfitting to the gate split.Next steps
Harbor benchmarks
If your benchmark runs via
harbor run, you may not need a custom runner at all — just point TerminalBenchRunner at your dataset.Agent templates
Learn how to write an agent template and a benchmark-specific PROGRAM.md supplement that gives the coding agent the right context to optimize effectively.