Run a Different Harbor Benchmark with auto-harness

auto-harness ships with TerminalBenchRunner, which delegates task execution to the harbor run CLI. Harbor provides isolated, reproducible benchmark environments — each task runs in its own sandboxed container. Because TerminalBenchRunner is driven entirely by configuration, you can point it at any Harbor-compatible dataset without writing a new runner class. The train/test split generation, gating logic, trace copying, and optimization loop all work unchanged.

How `TerminalBenchRunner` calls Harbor

TerminalBenchRunner.run() in benchmark.py builds a harbor run command from the configuration and parses the per-task result.json files that Harbor writes to an output directory:

harbor run \
  -d <dataset> \
  --agent-import-path agent.agent:HarnessAgent \
  --model <agent_model> \
  --env <env_provider> \
  --agent-timeout-multiplier <multiplier> \
  --jobs-dir workspace/tbench_jobs \
  -y

After the run, the runner scans the most recently created job subdirectory and reads each task’s result.json to extract the reward. This parsing logic is the only part that depends on your verifier’s output format.

Four steps to use a different Harbor benchmark

Point to your dataset in experiment_config.yaml

Set benchmark to "terminal-bench" to reuse TerminalBenchRunner, and set dataset to your Harbor dataset identifier:

benchmark: "terminal-bench"
dataset: "my-harbor-dataset@1.0"
agent_model: "gpt-4o"
env_provider: "e2b"
split: "train"
gate_split: "test"

The benchmark key selects which runner class gating.py and prepare.py instantiate. Keeping it "terminal-bench" means TerminalBenchRunner is used — only the dataset field changes which Harbor dataset is run.

env_provider can be "e2b", "daytona", or "docker". E2B and Daytona require API keys (E2B_API_KEY or DAYTONA_API_KEY). Docker runs locally and needs no key.

Verify your verifier's result.json schema

TerminalBenchRunner parses each task’s output using this exact schema:

{"task_name": "<id>", "verifier_result": {"rewards": {"reward": 0.85}}}

The relevant parser in TerminalBenchRunner.run() from benchmark.py:

task_name = data.get("task_name", trial_name)
vr = data.get("verifier_result")
if vr and isinstance(vr, dict):
    rewards = vr.get("rewards", {})
    reward = float(rewards.get("reward", 0.0)) if isinstance(rewards, dict) else 0.0
else:
    reward = None  # verifier did not run — infra error

If your verifier writes the reward at a different path in result.json, update this parser in TerminalBenchRunner.run(). For example, if your verifier writes {"score": 0.85} at the top level:

reward = float(data.get("score", 0.0))

A missing or malformed result.json produces reward = None, which counts as 0.0 in val_score. If your baseline scores look unexpectedly low, check whether Harbor is writing result.json at the expected path inside each task’s job subdirectory.

Update the split directory name (optional)

TerminalBenchRunner saves the train/test split to tbench_data/task_split.json, controlled by the class constant:

class TerminalBenchRunner(BenchmarkRunner):
    SPLIT_FILE = "tbench_data/task_split.json"

If you want separate split files per benchmark — useful when running multiple Harbor datasets against the same workspace — change SPLIT_FILE in your subclass or update the constant directly, then update prepare.py accordingly:

# In prepare.py
SPLIT_FILE = "my_benchmark_data/task_split.json"

The generate_terminal_bench_split() function in prepare.py creates the split during the baseline run. It performs a 70/30 stratified split (by pass/fail) with a fixed seed, so the split is reproducible.

Add a PROGRAM.md supplement

Create program_templates/<your_benchmark>.md with guidance specific to your dataset. Follow the same structure as program_templates/terminal_bench.md:

Trace file paths (where to read trace.json and result.json)
Task ID format (string names, integers, or something else)
Known techniques that improve scores on your benchmark
A diff command to compare the current agent against the template

Then register it in copy_program_template() in prepare.py:

def copy_program_template(benchmark: str) -> None:
    templates = {
        "tau-bench": "program_templates/tau_bench.md",
        "terminal-bench": "program_templates/terminal_bench.md",
        "bird-interact": "program_templates/bird_interact.md",
        "my-benchmark": "program_templates/my_benchmark.md",  # add this
    }

copy_program_template() composes PROGRAM.md by concatenating program_templates/base.md and your supplement. The coding agent reads the combined file as its loop instructions.

Example configuration for a custom Harbor benchmark

A complete experiment_config.yaml for a custom Harbor dataset:

benchmark: "terminal-bench"
dataset: "my-harbor-dataset@1.0"
agent_model: "gpt-4o"
env_provider: "e2b"
split: "train"
gate_split: "test"

And the expected result.json schema that TerminalBenchRunner parses:

{"task_name": "<id>", "verifier_result": {"rewards": {"reward": 0.85}}}

Expected result.json schema

TerminalBenchRunner reads one result.json per task from the Harbor job output directory. The full schema it expects:

Field	Type	Description
`task_name`	string	Task identifier. Falls back to the trial directory name if absent.
`verifier_result`	object	Verifier output. If missing or `null`, the task is recorded as `None` (infra error).
`verifier_result.rewards`	object	Reward container.
`verifier_result.rewards.reward`	float	Task reward in `[0.0, 1.0]`. Defaults to `0.0` if the key is absent.

If verifier_result is absent entirely, reward is set to None, which signals that the verifier did not run. This is treated as 0.0 in val_score and reported separately in the benchmark output.

How `copy_program_template()` composes PROGRAM.md

prepare.py calls copy_program_template(benchmark) during setup. The function reads program_templates/base.md (the shared loop instructions) and appends your benchmark-specific supplement:

def copy_program_template(benchmark: str) -> None:
    """Compose PROGRAM.md from the shared base and the benchmark-specific section."""
    templates = {
        "tau-bench": "program_templates/tau_bench.md",
        "terminal-bench": "program_templates/terminal_bench.md",
        "bird-interact": "program_templates/bird_interact.md",
    }
    template = templates.get(benchmark)
    # ...
    with open("program_templates/base.md") as f:
        base = f.read()
    with open(template) as f:
        benchmark_content = f.read()

    with open("PROGRAM.md", "w") as f:
        f.write(base.rstrip("\n") + "\n\n" + benchmark_content)

base.md covers the universal loop (run → analyze → improve → gate → record → repeat), file ownership rules, and workspace file formats. Your supplement adds benchmark-specific context on top. The coding agent receives the combined file as PROGRAM.md and never needs to know the two parts were separate.

What belongs in a benchmark-specific PROGRAM.md supplement

Looking at program_templates/terminal_bench.md and program_templates/bird_interact.md as reference, a good supplement covers:

Trace file paths

Tell the coding agent exactly where to read failure traces. For Harbor-based benchmarks this is:

workspace/traces/latest/<task_name>/trace.json    ← full conversation
workspace/traces/latest/<task_name>/result.json   ← reward, duration, config

Specify what to look for when analyzing a trace: which commands were run, whether the agent understood the task, whether it explored the environment, whether it verified its solution.

Task ID format

Clarify how task IDs are formatted so the coding agent can pass them correctly to --task-ids. For Terminal-Bench, task IDs are string names (cobol-modernization). For tau-bench, they are integers. For BIRD-Interact, they are instance_id strings.

Known improvement techniques

Include benchmark-specific techniques that have been shown to improve scores. terminal_bench.md lists six: environment bootstrapping, enforced TODO planning, non-interactive mode, double-confirmation, progressive reasoning, and forced reasoning in tool schema. Document equivalent techniques for your benchmark so the coding agent has a starting hypothesis for each iteration.

What the agent owns in agent/agent.py

Enumerate the specific variables and functions the coding agent should focus on. For Terminal-Bench this is AGENT_INSTRUCTION, TOOLS, MAX_STEPS, MAX_OUTPUT_CHARS, _truncate(), HarnessAgent.run(), and HarnessAgent.setup(). For your benchmark, list the equivalent optimization targets.

Diff command

Always include the diff command so the coding agent can review its accumulated changes:

diff agent/templates/my_benchmark.py agent/agent.py

Next steps

Custom benchmark runner

If Harbor doesn’t cover your benchmark, subclass BenchmarkRunner directly to integrate any CLI or API.

Agent templates

Learn how to write the agent template and PROGRAM.md supplement that the coding agent starts from.

Get Started

Core Concepts

Benchmarks

Extending

Run a Different Harbor Benchmark with auto-harness

How `TerminalBenchRunner` calls Harbor

Four steps to use a different Harbor benchmark

Example configuration for a custom Harbor benchmark

Expected result.json schema

How `copy_program_template()` composes PROGRAM.md

What belongs in a benchmark-specific PROGRAM.md supplement

Next steps

Custom benchmark runner

Agent templates

Build docs developers (and LLMs) love

Get Started

Core Concepts

Benchmarks

Extending

Documentation Index

​How TerminalBenchRunner calls Harbor

​Four steps to use a different Harbor benchmark

​Example configuration for a custom Harbor benchmark

​Expected result.json schema

​How copy_program_template() composes PROGRAM.md

​What belongs in a benchmark-specific PROGRAM.md supplement

​Next steps

Custom benchmark runner

Agent templates

Build docs developers (and LLMs) love

How `TerminalBenchRunner` calls Harbor

Four steps to use a different Harbor benchmark

Example configuration for a custom Harbor benchmark

Expected result.json schema

How `copy_program_template()` composes PROGRAM.md

What belongs in a benchmark-specific PROGRAM.md supplement

Next steps