Integrate any custom benchmark into auto-harness loop

auto-harness is built around a single abstract class: BenchmarkRunner. Every supported benchmark — tau-bench, Terminal-Bench 2.0, BIRD-Interact — is just a concrete subclass that implements one method. If your benchmark can run tasks and produce per-task rewards, you can integrate it in four focused steps without touching any of the harness logic that drives the optimization loop.

The `BenchmarkRunner` abstract base class

BenchmarkRunner lives in benchmark.py and defines the entire contract between your benchmark and the harness:

from abc import ABC, abstractmethod

class BenchmarkRunner(ABC):
    """Abstract benchmark runner. Subclass and implement `run` to plug in your own benchmark."""

    @abstractmethod
    def run(self, task_ids: list[str] | None = None) -> dict[str, float | None]:
        """
        Run the benchmark on the given tasks.

        Args:
            task_ids: specific task IDs to run. None runs the full benchmark.

        Returns:
            Mapping of task_id -> reward (float in [0.0, 1.0]). ``None`` means
            the task did not produce a verifier result — most often the agent
            timed out. ``None`` counts as ``0.0`` in :meth:`val_score`.
        """

    def val_score(self, results: dict[str, float | None]) -> float:
        """Mean reward across all results. ``None`` rewards count as ``0.0``."""
        if not results:
            return 0.0
        return sum(0.0 if v is None else v for v in results.values()) / len(results)

What `run()` must return

The return type is dict[str, float | None]:

Keys are task ID strings — they must match the IDs used throughout the harness (suite.json, train_results.json, trace directories).
Values are reward floats in the range [0.0, 1.0], or None if the task did not produce a verifier result (typically a timeout or infrastructure error).

None is a meaningful sentinel — it signals that the task ran but the verifier never produced a score, which is distinct from the agent actively failing. The harness reports timed-out tasks separately and they count as 0.0 in val_score.

How `val_score()` is computed

val_score is the mean reward across all results. None values are converted to 0.0 before averaging:

def val_score(self, results: dict[str, float | None]) -> float:
    if not results:
        return 0.0
    return sum(0.0 if v is None else v for v in results.values()) / len(results)

This is the number that gating.py compares against the best recorded score in workspace/results.tsv during Step 2 of the gate.

Subclassing `BenchmarkRunner`

The minimal implementation from the README:

class MyBenchmarkRunner(BenchmarkRunner):
    def run(self, task_ids=None):
        # call your benchmark CLI or API
        # return {task_id: reward} where reward is 0.0–1.0
        ...

A more complete starting point that follows the pattern of the existing runners:

class MyBenchmarkRunner(BenchmarkRunner):
    def __init__(self, split: str | None = "train", agent_model: str | None = None):
        self.split = split
        self.agent_model = agent_model or os.getenv("AGENT_MODEL", "gpt-5.4")

    def run(self, task_ids: list[str] | None = None) -> dict[str, float | None]:
        if task_ids is None:
            task_ids = self._load_split_tasks()

        results: dict[str, float | None] = {}
        for tid in task_ids:
            try:
                reward = self._run_single_task(tid)
                results[tid] = float(reward)
            except TimeoutError:
                results[tid] = None   # timeout → counts as 0.0 in val_score
        return results

Always fill in None for tasks that were requested but produced no result. The gating step uses task_ids as the denominator, so silently dropping a task is different from returning None — a dropped task disappears from the pass-rate calculation, while None counts as a failure.

Integration steps

Subclass BenchmarkRunner in benchmark.py

Add your class to benchmark.py. Follow the same pattern as TauBenchRunner, TerminalBenchRunner, or BirdInteractRunner: accept configuration in __init__, implement run(), and optionally copy train traces to workspace/traces/ so the coding agent can read failure traces.Import the new class at the top of gating.py:

from benchmark import BenchmarkRunner, MyBenchmarkRunner, TauBenchRunner, TerminalBenchRunner

Add a branch in gating.py's _create_runners()

_create_runners() reads experiment_config.yaml and instantiates a train runner and a gate runner. Add a branch for your benchmark name:

def _create_runners(cfg: dict) -> tuple[BenchmarkRunner, BenchmarkRunner]:
    benchmark = cfg.get("benchmark", "tau-bench")

    # ... existing branches ...

    elif benchmark == "my-benchmark":
        train_runner = MyBenchmarkRunner(
            split=cfg.get("split", "train"),
            agent_model=cfg.get("agent_model"),
        )
        gate_runner = MyBenchmarkRunner(
            split=cfg.get("gate_split", "test"),
            agent_model=cfg.get("agent_model"),
        )
    else:
        print(f"ERROR: unknown benchmark '{benchmark}'")
        sys.exit(1)

    return train_runner, gate_runner

The train runner runs on the training split to populate workspace/train_results.json and to run the regression suite (Step 1). The gate runner runs the test split to produce the val_score checked in Step 2.

Add a branch in prepare.py's __main__

prepare.py handles environment checks, workspace initialization, template copying, and the baseline run. Add your benchmark to each relevant section:

# Environment check
if benchmark == "my-benchmark":
    if not check_env_my_benchmark(cfg):
        sys.exit(1)

# Baseline run — inside run_baseline()
elif benchmark == "my-benchmark":
    from benchmark import MyBenchmarkRunner
    runner = MyBenchmarkRunner(
        split=cfg.get("gate_split", "test"),
        agent_model=cfg.get("agent_model"),
    )
    test_results = runner.run()
    val = runner.val_score(test_results)

If your benchmark requires a train/test split (recommended), generate it during the baseline run just as generate_terminal_bench_split() does in the terminal-bench path.

Create templates in agent/templates/ and program_templates/

The coding agent needs a starting-point implementation and benchmark-specific loop instructions:

agent/templates/my_benchmark.py — the HarnessAgent class tailored to your benchmark’s interface. See the next section for what HarnessAgent must look like.
program_templates/my_benchmark.md — guidance appended to PROGRAM.md, covering: trace file paths, task ID format, known techniques for your benchmark, and a diff command to compare the current agent/agent.py against the template.

# In copy_agent_template()
templates = {
    "tau-bench": "agent/templates/tau_bench.py",
    "terminal-bench": "agent/templates/terminal_bench.py",
    "bird-interact": "agent/templates/bird_interact.py",
    "my-benchmark": "agent/templates/my_benchmark.py",   # add this
}

# In copy_program_template()
templates = {
    "tau-bench": "program_templates/tau_bench.md",
    "terminal-bench": "program_templates/terminal_bench.md",
    "bird-interact": "program_templates/bird_interact.md",
    "my-benchmark": "program_templates/my_benchmark.md",  # add this
}

What HarnessAgent must implement

The coding agent edits agent/agent.py every iteration. benchmark.py imports HarnessAgent directly from that file, so the interface your runner expects is the interface HarnessAgent must satisfy. The exact interface depends on which framework your benchmark uses. Looking at the three existing templates:

tau-bench (tau_bench.py) — HarnessAgent extends LLMAgent from the tau2 library. It implements system_prompt, get_init_state(), and generate_next_message(). The tau-bench runner receives HarnessAgent via the tau2 registry and calls these methods.
Terminal-Bench 2.0 (terminal_bench.py) — HarnessAgent extends BaseAgent from harbor.agents.base. It implements name(), version(), setup(), and run(instruction, environment, context). The Harbor framework instantiates the class via --agent-import-path agent.agent:HarnessAgent and calls run() per task.
BIRD-Interact (bird_interact.py) — HarnessAgent is not a class the runner instantiates directly. Instead, agent.py exports a build_agent(mode) function that returns a Google ADK Agent. The harness wraps that agent as a FastAPI service.

For a custom benchmark, your template should define whatever interface your MyBenchmarkRunner.run() imports and calls. The coding agent optimizes the internals (system prompt, tool definitions, loop logic) without being required to know about benchmark.py or gating.py.

Keep the HarnessAgent interface surface small. The more behavior lives in the template, the more freedom the coding agent has to improve it. Thin wrappers that delegate to the benchmark framework are harder to optimize than agents that own their own loop.

The loop, gating, and workspace are benchmark-agnostic

Once your runner, gating branch, and templates are in place, the entire harness works as-is for your benchmark:

The optimization loop (PROGRAM.md)

prepare.py composes PROGRAM.md from program_templates/base.md + your benchmark-specific supplement. The coding agent reads this file to understand the run → analyze → improve → gate → record → repeat loop. The loop itself never changes between benchmarks.

Three-step gating (gating.py)

run_gate() calls your train and gate runners through the BenchmarkRunner interface. Step 0 checks for disallowed file edits. Step 1 re-runs the regression suite tasks. Step 2 compares val_score to the best seen in results.tsv. Step 3 promotes newly-passing tasks. None of this logic is benchmark-specific.

Result recording (record.py)

record.py appends a row to workspace/results.tsv. It never calls run() — it just records the val_score and commit that were passed to it after a successful gate. Format is identical regardless of benchmark.

Workspace structure

workspace/suite.json, workspace/results.tsv, workspace/train_results.json, and workspace/learnings.md have fixed schemas that are written and read by the harness infrastructure. Your runner writes per-task results; the harness does everything else.

Structural anti-cheating

Test traces are never saved to disk. TerminalBenchRunner checks self.split != "train" before copying traces. Follow the same pattern in your runner to prevent the coding agent from reading test traces and overfitting to the gate split.

Next steps

Harbor benchmarks

If your benchmark runs via harbor run, you may not need a custom runner at all — just point TerminalBenchRunner at your dataset.

Agent templates

Learn how to write an agent template and a benchmark-specific PROGRAM.md supplement that gives the coding agent the right context to optimize effectively.

Get Started

Core Concepts

Benchmarks

Extending

Integrate any custom benchmark into auto-harness loop

The `BenchmarkRunner` abstract base class

What `run()` must return

How `val_score()` is computed

Subclassing `BenchmarkRunner`

Integration steps

What HarnessAgent must implement

The loop, gating, and workspace are benchmark-agnostic

Next steps

Harbor benchmarks

Agent templates

Build docs developers (and LLMs) love

Get Started

Core Concepts

Benchmarks

Extending

Documentation Index

​The BenchmarkRunner abstract base class

​What run() must return

​How val_score() is computed

​Subclassing BenchmarkRunner

​Integration steps

​What HarnessAgent must implement

​The loop, gating, and workspace are benchmark-agnostic

​Next steps

Harbor benchmarks

Agent templates

Build docs developers (and LLMs) love

The `BenchmarkRunner` abstract base class

What `run()` must return

How `val_score()` is computed

Subclassing `BenchmarkRunner`

Integration steps

What HarnessAgent must implement

The loop, gating, and workspace are benchmark-agnostic

Next steps