auto-harness ships withDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/neosigmaai/auto-harness/llms.txt
Use this file to discover all available pages before exploring further.
TerminalBenchRunner, which delegates task execution to the harbor run CLI. Harbor provides isolated, reproducible benchmark environments — each task runs in its own sandboxed container. Because TerminalBenchRunner is driven entirely by configuration, you can point it at any Harbor-compatible dataset without writing a new runner class. The train/test split generation, gating logic, trace copying, and optimization loop all work unchanged.
How TerminalBenchRunner calls Harbor
TerminalBenchRunner.run() in benchmark.py builds a harbor run command from the configuration and parses the per-task result.json files that Harbor writes to an output directory:
result.json to extract the reward. This parsing logic is the only part that depends on your verifier’s output format.
Four steps to use a different Harbor benchmark
Point to your dataset in experiment_config.yaml
Set The
benchmark to "terminal-bench" to reuse TerminalBenchRunner, and set dataset to your Harbor dataset identifier:benchmark key selects which runner class gating.py and prepare.py instantiate. Keeping it "terminal-bench" means TerminalBenchRunner is used — only the dataset field changes which Harbor dataset is run.env_provider can be "e2b", "daytona", or "docker". E2B and Daytona require API keys (E2B_API_KEY or DAYTONA_API_KEY). Docker runs locally and needs no key.Verify your verifier's result.json schema
TerminalBenchRunner parses each task’s output using this exact schema:TerminalBenchRunner.run() from benchmark.py:result.json, update this parser in TerminalBenchRunner.run(). For example, if your verifier writes {"score": 0.85} at the top level:Update the split directory name (optional)
TerminalBenchRunner saves the train/test split to tbench_data/task_split.json, controlled by the class constant:SPLIT_FILE in your subclass or update the constant directly, then update prepare.py accordingly:generate_terminal_bench_split() function in prepare.py creates the split during the baseline run. It performs a 70/30 stratified split (by pass/fail) with a fixed seed, so the split is reproducible.Add a PROGRAM.md supplement
Create
program_templates/<your_benchmark>.md with guidance specific to your dataset. Follow the same structure as program_templates/terminal_bench.md:- Trace file paths (where to read
trace.jsonandresult.json) - Task ID format (string names, integers, or something else)
- Known techniques that improve scores on your benchmark
- A diff command to compare the current agent against the template
copy_program_template() in prepare.py:copy_program_template() composes PROGRAM.md by concatenating program_templates/base.md and your supplement. The coding agent reads the combined file as its loop instructions.Example configuration for a custom Harbor benchmark
A completeexperiment_config.yaml for a custom Harbor dataset:
result.json schema that TerminalBenchRunner parses:
Expected result.json schema
TerminalBenchRunner reads one result.json per task from the Harbor job output directory. The full schema it expects:
| Field | Type | Description |
|---|---|---|
task_name | string | Task identifier. Falls back to the trial directory name if absent. |
verifier_result | object | Verifier output. If missing or null, the task is recorded as None (infra error). |
verifier_result.rewards | object | Reward container. |
verifier_result.rewards.reward | float | Task reward in [0.0, 1.0]. Defaults to 0.0 if the key is absent. |
verifier_result is absent entirely, reward is set to None, which signals that the verifier did not run. This is treated as 0.0 in val_score and reported separately in the benchmark output.
How copy_program_template() composes PROGRAM.md
prepare.py calls copy_program_template(benchmark) during setup. The function reads program_templates/base.md (the shared loop instructions) and appends your benchmark-specific supplement:
base.md covers the universal loop (run → analyze → improve → gate → record → repeat), file ownership rules, and workspace file formats. Your supplement adds benchmark-specific context on top. The coding agent receives the combined file as PROGRAM.md and never needs to know the two parts were separate.
What belongs in a benchmark-specific PROGRAM.md supplement
Looking atprogram_templates/terminal_bench.md and program_templates/bird_interact.md as reference, a good supplement covers:
Trace file paths
Trace file paths
Tell the coding agent exactly where to read failure traces. For Harbor-based benchmarks this is:Specify what to look for when analyzing a trace: which commands were run, whether the agent understood the task, whether it explored the environment, whether it verified its solution.
Task ID format
Task ID format
Clarify how task IDs are formatted so the coding agent can pass them correctly to
--task-ids. For Terminal-Bench, task IDs are string names (cobol-modernization). For tau-bench, they are integers. For BIRD-Interact, they are instance_id strings.Known improvement techniques
Known improvement techniques
Include benchmark-specific techniques that have been shown to improve scores.
terminal_bench.md lists six: environment bootstrapping, enforced TODO planning, non-interactive mode, double-confirmation, progressive reasoning, and forced reasoning in tool schema. Document equivalent techniques for your benchmark so the coding agent has a starting hypothesis for each iteration.What the agent owns in agent/agent.py
What the agent owns in agent/agent.py
Enumerate the specific variables and functions the coding agent should focus on. For Terminal-Bench this is
AGENT_INSTRUCTION, TOOLS, MAX_STEPS, MAX_OUTPUT_CHARS, _truncate(), HarnessAgent.run(), and HarnessAgent.setup(). For your benchmark, list the equivalent optimization targets.Diff command
Diff command
Always include the diff command so the coding agent can review its accumulated changes:
Next steps
Custom benchmark runner
If Harbor doesn’t cover your benchmark, subclass BenchmarkRunner directly to integrate any CLI or API.
Agent templates
Learn how to write the agent template and PROGRAM.md supplement that the coding agent starts from.