When you runDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/neosigmaai/auto-harness/llms.txt
Use this file to discover all available pages before exploring further.
python prepare.py, it copies two files that define the entire starting point for the optimization loop: the agent implementation and the coding agent’s loop instructions. Both come from template directories that are read-only during optimization — the coding agent edits only agent/agent.py and workspace/learnings.md, never the templates themselves. Understanding this template system is the key to wiring up a new benchmark correctly.
The template system
prepare.py copies the correct agent/templates/<benchmark>.py to agent/agent.py, then composes PROGRAM.md from program_templates/base.md concatenated with the benchmark-specific supplement. After that, the templates are not touched again — only agent/agent.py changes from iteration to iteration.
The templates in
agent/templates/ are read-only reference points. To see what the coding agent has changed across all iterations, run:How prepare.py copies templates
copy_agent_template() in prepare.py handles the agent file:
copy_program_template() handles PROGRAM.md:
benchmark value comes from the benchmark key in experiment_config.yaml.
The three agent templates
tau_bench.py — tau-bench
The tau-bench template integrates with the tau2 Python API. HarnessAgent extends LLMAgent from tau2.agent.llm_agent:
HarnessAgent with the tau2 registry using a factory function. The domain tools are injected by tau2 at runtime — the coding agent cannot add new tools for tau-bench runs, so optimization focuses on AGENT_INSTRUCTION, the system_prompt property, generate_next_message(), and state management.
terminal_bench.py — Terminal-Bench 2.0
The Terminal-Bench template extends Harbor’s BaseAgent. HarnessAgent receives the task as a string instruction and has access to a BaseEnvironment for executing bash commands:
run() method owns the full agentic loop: it calls the LLM with litellm.acompletion, executes bash tool calls via environment.exec(), manages the message history, and saves the conversation trace to self.logs_dir / "trace.json". Every part of this loop is the coding agent’s optimization target.
bird_interact.py — BIRD-Interact
The BIRD-Interact template is structured differently. Rather than a class the runner instantiates, agent.py exports a build_agent(mode) function that returns a Google ADK Agent. The harness wraps this agent as a FastAPI service via agent/helpers/bird_interact/bird_service.py:
AINTERACT_INSTRUCTION, CINTERACT_INSTRUCTION, and the build_agent() configuration. The external BIRD-Interact-ADK repo is treated as read-only benchmark infrastructure — it is never edited during the optimization loop.
What HarnessAgent must implement
The required interface is determined by the benchmark framework:| Benchmark | Base class | Required methods |
|---|---|---|
| tau-bench | tau2.agent.llm_agent.LLMAgent | system_prompt (property), get_init_state(), generate_next_message() |
| Terminal-Bench 2.0 | harbor.agents.base.BaseAgent | name() (static), version(), setup(), run() |
| BIRD-Interact | — | build_agent(mode) module-level function returning an ADK Agent |
MyBenchmarkRunner. Whatever run() imports from agent.agent is what HarnessAgent must provide.
Creating a custom agent template
To create a new template for a custom benchmark:Copy an existing template as a starting point
Adapt the HarnessAgent interface
Update the imports, base class, and method signatures to match what your
MyBenchmarkRunner expects. Keep the class name HarnessAgent — the runner imports it by that exact name.Preserve the optimization targets as top-level variables: system prompt strings as constants (e.g., AGENT_INSTRUCTION), tool definitions as module-level lists, and loop parameters as named constants. This makes them easy for the coding agent to find and edit.Register the template in prepare.py
Add an entry to both
copy_agent_template() and copy_program_template():The program_templates/ structure
program_templates/base.md is the benchmark-agnostic core of PROGRAM.md. It defines:
- What the coding agent is doing (run → analyze → improve → gate → record → repeat)
- Which files the agent owns (
agent/agent.py,workspace/learnings.md) - The command table (
benchmark.py,gating.py,record.py,prepare.py) - The full loop with exact steps and exit conditions
- The seven rules (only edit allowed files, never skip the gate, one hypothesis per iteration, etc.)
- Workspace file formats (
suite.json,train_results.json,results.tsv)
base.md cannot know:
What belongs in a benchmark-specific supplement
Trace file paths
Trace file paths
Where are
trace.json and result.json written? For Terminal-Bench and BIRD-Interact this is workspace/traces/latest/<task_name>/. For tau-bench there are no file-based traces (the simulation results are returned in memory). Always specify exactly what fields are in the trace and what to look for when diagnosing failures.Task ID format and ad-hoc commands
Task ID format and ad-hoc commands
Specify how task IDs are formatted and how to pass them to
benchmark.py --task-ids. Terminal-Bench uses string names (cobol-modernization), tau-bench uses integers (0 1 42), BIRD-Interact uses instance_id strings.Optimization targets in agent/agent.py
Optimization targets in agent/agent.py
Name the specific variables and methods the coding agent should focus on. For Terminal-Bench:
AGENT_INSTRUCTION, TOOLS, MAX_STEPS, MAX_OUTPUT_CHARS, _truncate(), HarnessAgent.run(), HarnessAgent.setup(). For BIRD-Interact: AINTERACT_INSTRUCTION, CINTERACT_INSTRUCTION, build_agent().Known techniques that improve scores
Known techniques that improve scores
Include empirically validated improvements for your benchmark. The
terminal_bench.md supplement lists six: environment bootstrapping (+5–10%), enforced TODO planning (+10–20%), non-interactive mode (+3–5%), double-confirmation (+3–5%), progressive reasoning (+2–5%), and forced reasoning in the tool schema. Document equivalent findings for your benchmark so each iteration starts from a reasonable prior.Diff command
Diff command
Always include the one-liner to compare the current agent against its starting template:This is how the coding agent reviews the accumulated effect of all its changes between sessions.
What the agent must never do
What the agent must never do
List benchmark-specific prohibitions. For Terminal-Bench this includes: never hardcode
MODEL or AGENT_REASONING_EFFORT, never read traces from workspace/tbench_jobs/, never modify files in agent/templates/ or tbench_data/. For BIRD-Interact: never edit the external BIRD-Interact-ADK repo.Next steps
Custom benchmark runner
Subclass BenchmarkRunner to integrate any benchmark that is not Harbor-based.
Harbor benchmarks
Use a different Harbor dataset without writing a custom runner class.