Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/neosigmaai/auto-harness/llms.txt

Use this file to discover all available pages before exploring further.

tau-bench is a customer service simulation benchmark where an agent must complete realistic service tasks — issuing refunds, changing flights, updating account plans — by making structured tool calls against a domain-specific policy and database. auto-harness integrates it through the tau2 Python API directly (no subprocess), registering HarnessAgent as a custom agent factory in the tau2 registry. The three supported domains together cover 278 tasks: retail (114), airline (50), and telecom (114).

Agent interface

Unlike Terminal-Bench, the tau-bench agent does not control its own tool list. tau2 injects a fixed set of domain tools at runtime — order lookup, flight rebooking, plan change, and similar operations depending on the domain. Your agent/agent.py implements HarnessAgent, which receives those tools and must decide when and how to call them in response to user messages. The optimization loop can improve the system prompt (AGENT_INSTRUCTION), the message construction logic in generate_next_message(), and the state management in HarnessState. It cannot add new tools for tau-bench runs.

Domains and task counts

DomainTasksDescription
retail114E-commerce orders, returns, and account management
airline50Flight changes, cancellations, and upgrades
telecom114Plan changes, billing disputes, and service requests
Set the domain key in experiment_config.yaml to run one domain at a time.

TauBenchRunner

TauBenchRunner in benchmark.py uses the tau2 Python API directly. It registers HarnessAgent as a custom agent factory under the name "custom_agent" in the tau2 registry, then calls run_domain() with a TextRunConfig.

Constructor

TauBenchRunner(
    domain: str,                          # "retail", "airline", or "telecom"
    agent_model: str | None = None,       # default: env AGENT_MODEL or "gpt-5.4"
    split: str = "test",                  # tau2 split name
    max_concurrency: int = 3,             # simultaneous simulations
    seed: int = 300,                      # random seed for reproducibility
    reasoning_effort: str | None = None,  # passed as AGENT_REASONING_EFFORT
    user_model: str | None = None,        # model for the user simulator; defaults to agent_model
)

How it works

The runner uses a thread lock (_registry_lock) to safely register HarnessAgent in the tau2 registry once per process, even when called from concurrent contexts:
from tau2.data_model.simulation import TextRunConfig
from tau2 import registry
from tau2.run import run_domain
from agent.agent import HarnessAgent

def _create_harness_agent(tools, domain_policy, **kwargs):
    return HarnessAgent(
        tools=tools,
        domain_policy=domain_policy,
        llm=kwargs.get("llm"),
        llm_args=kwargs.get("llm_args"),
    )

with _registry_lock:
    if registry.get_agent_factory("custom_agent") is None:
        registry.register_agent_factory(_create_harness_agent, "custom_agent")

config = TextRunConfig(
    domain=self.domain,
    agent="custom_agent",
    llm_agent=self.agent_model,
    llm_user=self.user_model,
    task_split_name=self.split,
    task_ids=task_ids,
    max_concurrency=self.max_concurrency,
    seed=self.seed,
)
results = run_domain(config)
The return value is a {task_id: reward} dict built from results.simulations:
return {
    str(sim.task_id): float(sim.reward_info.reward) if sim.reward_info else 0.0
    for sim in results.simulations
}

Running specific tasks

tau-bench task IDs are integers. Pass them as strings to run():
python benchmark.py --task-ids 0 1 42
runner = TauBenchRunner(domain="retail", split="train")
results = runner.run(task_ids=["0", "1", "42"])

Data directory

tau2 reads TAU2_DATA_DIR at import time. TauBenchRunner sets this automatically to ./tau2_data/ if the variable is not already set. prepare.py clones the tau2 data repo into that directory on first run.

Configuration

Uncomment and edit the tau-bench block in experiment_config.yaml:
benchmark: "tau-bench"
agent_model: "gpt-5.4"
domain: "retail"               # "retail", "airline", or "telecom"
split: "train"
gate_split: "test"
max_concurrency: 3
threshold: 0.8
reasoning_effort: "medium"     # optional
Required environment variables:
  • OPENAI_API_KEY (or ANTHROPIC_API_KEY for Claude models, GEMINI_API_KEY for Gemini)

Quick start

tau-bench requires Docker for data provisioning. The recommended workflow is via docker compose.
1

Set environment variables

cp .env.example .env
# Set OPENAI_API_KEY in .env
2

Configure the experiment

cp experiment_config.yaml.template experiment_config.yaml
# Uncomment the tau-bench section and set your domain and model
3

Build the Docker image

docker compose build
This installs tau-bench and all dependencies via uv inside the container.
4

Run prepare.py

docker compose run autoeval python prepare.py
This clones tau2 data, copies agent and program templates, and records the baseline score.
5

Start the optimization loop

Point your coding agent at the repo and prompt:
Read PROGRAM.md and start the optimization loop.
The baseline is already recorded. Start from step 2 (analyze failures).
tau-bench uses the split mechanism built into tau2 (task_split_name in TextRunConfig) rather than a local split file. There is no tbench_data/task_split.json equivalent for tau-bench.

Editing agent/agent.py

The tau-bench agent template at agent/templates/tau_bench.py is the starting point. The coding agent can improve:
  • AGENT_INSTRUCTION — the system prompt describing policy adherence, tool usage, and conversation strategy
  • generate_next_message() — how the agent constructs its next message given conversation history
  • HarnessState — state management across multi-turn conversations
AGENT_MODEL and AGENT_REASONING_EFFORT are set by the harness from experiment_config.yaml. Do not hardcode these values in agent/agent.py.

Build docs developers (and LLMs) love