Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/neosigmaai/auto-harness/llms.txt

Use this file to discover all available pages before exploring further.

When you run python prepare.py, it copies two files that define the entire starting point for the optimization loop: the agent implementation and the coding agent’s loop instructions. Both come from template directories that are read-only during optimization — the coding agent edits only agent/agent.py and workspace/learnings.md, never the templates themselves. Understanding this template system is the key to wiring up a new benchmark correctly.

The template system

agent/templates/
├── tau_bench.py           # tau-bench agent starting point
├── terminal_bench.py      # Terminal-Bench 2.0 agent starting point
└── bird_interact.py       # BIRD-Interact system agent starting point

program_templates/
├── base.md                # shared loop instructions (benchmark-agnostic)
├── tau_bench.md           # tau-bench supplement
├── terminal_bench.md      # Terminal-Bench supplement
└── bird_interact.md       # BIRD-Interact supplement
prepare.py copies the correct agent/templates/<benchmark>.py to agent/agent.py, then composes PROGRAM.md from program_templates/base.md concatenated with the benchmark-specific supplement. After that, the templates are not touched again — only agent/agent.py changes from iteration to iteration.
The templates in agent/templates/ are read-only reference points. To see what the coding agent has changed across all iterations, run:
diff agent/templates/terminal_bench.py agent/agent.py

How prepare.py copies templates

copy_agent_template() in prepare.py handles the agent file:
def copy_agent_template(benchmark: str) -> None:
    """Copy the correct agent template into agent/agent.py."""
    templates = {
        "tau-bench": "agent/templates/tau_bench.py",
        "terminal-bench": "agent/templates/terminal_bench.py",
        "bird-interact": "agent/templates/bird_interact.py",
    }
    template = templates.get(benchmark)
    if not template or not os.path.exists(template):
        print(f"[prepare] ERROR: no agent template for benchmark '{benchmark}'")
        sys.exit(1)

    shutil.copy2(template, "agent/agent.py")
    print(f"[prepare] copied {template} → agent/agent.py")
copy_program_template() handles PROGRAM.md:
def copy_program_template(benchmark: str) -> None:
    """Compose PROGRAM.md from the shared base and the benchmark-specific section."""
    templates = {
        "tau-bench": "program_templates/tau_bench.md",
        "terminal-bench": "program_templates/terminal_bench.md",
        "bird-interact": "program_templates/bird_interact.md",
    }
    template = templates.get(benchmark)
    # ...
    with open("program_templates/base.md") as f:
        base = f.read()
    with open(template) as f:
        benchmark_content = f.read()

    with open("PROGRAM.md", "w") as f:
        f.write(base.rstrip("\n") + "\n\n" + benchmark_content)
    print(f"[prepare] composed PROGRAM.md from program_templates/base.md + {template}")
The benchmark value comes from the benchmark key in experiment_config.yaml.

The three agent templates

tau_bench.py — tau-bench

The tau-bench template integrates with the tau2 Python API. HarnessAgent extends LLMAgent from tau2.agent.llm_agent:
from tau2.agent.llm_agent import LLMAgent
from tau2.data_model.message import AssistantMessage, Message, MultiToolMessage, SystemMessage
from tau2.utils.llm_utils import generate

AGENT_MODEL: str = os.environ.get("AGENT_MODEL", "")

AGENT_INSTRUCTION = """
You are a helpful assistant that completes tasks according to the <policy> provided below.
""".strip()

@dataclass
class HarnessState:
    messages: list[Message] = field(default_factory=list)

class HarnessAgent(LLMAgent):
    """Agent under optimization."""

    @property
    def system_prompt(self) -> str:
        if self.domain_policy:
            return (
                "<instructions>\n"
                f"{AGENT_INSTRUCTION}\n"
                "</instructions>\n"
                "<policy>\n"
                f"{self.domain_policy}\n"
                "</policy>"
            )
        return AGENT_INSTRUCTION

    def get_init_state(
        self, message_history: list[Message] | None = None
    ) -> HarnessState:
        ...

    def generate_next_message(
        self,
        message: ValidAgentInputMessage,
        state: HarnessState,
    ) -> tuple[AssistantMessage, HarnessState]:
        ...
The tau-bench runner registers HarnessAgent with the tau2 registry using a factory function. The domain tools are injected by tau2 at runtime — the coding agent cannot add new tools for tau-bench runs, so optimization focuses on AGENT_INSTRUCTION, the system_prompt property, generate_next_message(), and state management.

terminal_bench.py — Terminal-Bench 2.0

The Terminal-Bench template extends Harbor’s BaseAgent. HarnessAgent receives the task as a string instruction and has access to a BaseEnvironment for executing bash commands:
from harbor.agents.base import BaseAgent
from harbor.environments.base import BaseEnvironment
from harbor.models.agent.context import AgentContext

MAX_STEPS = 80
MAX_OUTPUT_CHARS = 8000
MODEL = os.environ.get("AGENT_MODEL", "gpt-5.4")

AGENT_INSTRUCTION = """\
You are an autonomous terminal agent. You are given a task and a Linux container.
You solve tasks by executing bash commands. Work step by step.
...
"""

TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "bash",
            "description": "Execute a bash command in the container. Returns stdout and stderr.",
            ...
        },
    }
]

class HarnessAgent(BaseAgent):
    """Agent under optimization for Terminal-Bench 2.0."""

    @staticmethod
    def name() -> str:
        return "harness-agent"

    def version(self) -> str | None:
        return "0.1.0"

    async def setup(self, environment: BaseEnvironment) -> None:
        pass

    async def run(
        self,
        instruction: str,
        environment: BaseEnvironment,
        context: AgentContext,
    ) -> None:
        ...
The run() method owns the full agentic loop: it calls the LLM with litellm.acompletion, executes bash tool calls via environment.exec(), manages the message history, and saves the conversation trace to self.logs_dir / "trace.json". Every part of this loop is the coding agent’s optimization target.

bird_interact.py — BIRD-Interact

The BIRD-Interact template is structured differently. Rather than a class the runner instantiates, agent.py exports a build_agent(mode) function that returns a Google ADK Agent. The harness wraps this agent as a FastAPI service via agent/helpers/bird_interact/bird_service.py:
from google.adk import Agent
from google.adk.tools import FunctionTool

AINTERACT_INSTRUCTION = """You are a helpful PostgreSQL agent that interacts with a user and a
database to solve the user's question..."""

CINTERACT_INSTRUCTION = """You are a data scientist with great PostgreSQL writing ability..."""

def build_agent(mode: str = "a-interact") -> Agent:
    """Build the BIRD-Interact system agent for the requested mode."""
    if mode == "a-interact":
        return Agent(
            **_agent_kwargs(),
            instruction=AINTERACT_INSTRUCTION,
            tools=get_ainteract_tools(),
            ...
        )
    # c-interact path
    return Agent(
        **_agent_kwargs(),
        instruction=CINTERACT_INSTRUCTION,
        tools=[FunctionTool(ask_user), FunctionTool(submit_sql)],
        ...
    )
Optimization targets are AINTERACT_INSTRUCTION, CINTERACT_INSTRUCTION, and the build_agent() configuration. The external BIRD-Interact-ADK repo is treated as read-only benchmark infrastructure — it is never edited during the optimization loop.

What HarnessAgent must implement

The required interface is determined by the benchmark framework:
BenchmarkBase classRequired methods
tau-benchtau2.agent.llm_agent.LLMAgentsystem_prompt (property), get_init_state(), generate_next_message()
Terminal-Bench 2.0harbor.agents.base.BaseAgentname() (static), version(), setup(), run()
BIRD-Interactbuild_agent(mode) module-level function returning an ADK Agent
For a custom benchmark, you define this interface when you write MyBenchmarkRunner. Whatever run() imports from agent.agent is what HarnessAgent must provide.

Creating a custom agent template

To create a new template for a custom benchmark:
1

Copy an existing template as a starting point

cp agent/templates/terminal_bench.py agent/templates/my_benchmark.py
Choose the existing template whose benchmark framework is closest to yours. The Terminal-Bench template is a good general starting point for any benchmark where the agent runs a tool loop.
2

Adapt the HarnessAgent interface

Update the imports, base class, and method signatures to match what your MyBenchmarkRunner expects. Keep the class name HarnessAgent — the runner imports it by that exact name.Preserve the optimization targets as top-level variables: system prompt strings as constants (e.g., AGENT_INSTRUCTION), tool definitions as module-level lists, and loop parameters as named constants. This makes them easy for the coding agent to find and edit.
3

Register the template in prepare.py

Add an entry to both copy_agent_template() and copy_program_template():
# In copy_agent_template()
templates = {
    "tau-bench": "agent/templates/tau_bench.py",
    "terminal-bench": "agent/templates/terminal_bench.py",
    "bird-interact": "agent/templates/bird_interact.py",
    "my-benchmark": "agent/templates/my_benchmark.py",
}

# In copy_program_template()
templates = {
    "tau-bench": "program_templates/tau_bench.md",
    "terminal-bench": "program_templates/terminal_bench.md",
    "bird-interact": "program_templates/bird_interact.md",
    "my-benchmark": "program_templates/my_benchmark.md",
}
4

Run prepare.py to copy the template into agent/agent.py

python prepare.py
This copies your template to agent/agent.py, composes PROGRAM.md, and runs the baseline benchmark. After this completes, the coding agent has a starting point and loop instructions.

The program_templates/ structure

program_templates/base.md is the benchmark-agnostic core of PROGRAM.md. It defines:
  • What the coding agent is doing (run → analyze → improve → gate → record → repeat)
  • Which files the agent owns (agent/agent.py, workspace/learnings.md)
  • The command table (benchmark.py, gating.py, record.py, prepare.py)
  • The full loop with exact steps and exit conditions
  • The seven rules (only edit allowed files, never skip the gate, one hypothesis per iteration, etc.)
  • Workspace file formats (suite.json, train_results.json, results.tsv)
Each benchmark-specific supplement adds context that base.md cannot know:
program_templates/
├── base.md              # universal loop (never benchmark-specific)
├── tau_bench.md         # tau-bench supplement: task ID format, trace locations, editing guidance
├── terminal_bench.md    # Terminal-Bench supplement: trace paths, techniques, diff command
└── bird_interact.md     # BIRD-Interact supplement: modes, trace format, editing targets

What belongs in a benchmark-specific supplement

Where are trace.json and result.json written? For Terminal-Bench and BIRD-Interact this is workspace/traces/latest/<task_name>/. For tau-bench there are no file-based traces (the simulation results are returned in memory). Always specify exactly what fields are in the trace and what to look for when diagnosing failures.
Specify how task IDs are formatted and how to pass them to benchmark.py --task-ids. Terminal-Bench uses string names (cobol-modernization), tau-bench uses integers (0 1 42), BIRD-Interact uses instance_id strings.
Name the specific variables and methods the coding agent should focus on. For Terminal-Bench: AGENT_INSTRUCTION, TOOLS, MAX_STEPS, MAX_OUTPUT_CHARS, _truncate(), HarnessAgent.run(), HarnessAgent.setup(). For BIRD-Interact: AINTERACT_INSTRUCTION, CINTERACT_INSTRUCTION, build_agent().
Include empirically validated improvements for your benchmark. The terminal_bench.md supplement lists six: environment bootstrapping (+5–10%), enforced TODO planning (+10–20%), non-interactive mode (+3–5%), double-confirmation (+3–5%), progressive reasoning (+2–5%), and forced reasoning in the tool schema. Document equivalent findings for your benchmark so each iteration starts from a reasonable prior.
Always include the one-liner to compare the current agent against its starting template:
diff agent/templates/my_benchmark.py agent/agent.py
This is how the coding agent reviews the accumulated effect of all its changes between sessions.
List benchmark-specific prohibitions. For Terminal-Bench this includes: never hardcode MODEL or AGENT_REASONING_EFFORT, never read traces from workspace/tbench_jobs/, never modify files in agent/templates/ or tbench_data/. For BIRD-Interact: never edit the external BIRD-Interact-ADK repo.

Next steps

Custom benchmark runner

Subclass BenchmarkRunner to integrate any benchmark that is not Harbor-based.

Harbor benchmarks

Use a different Harbor dataset without writing a custom runner class.

Build docs developers (and LLMs) love