tau-bench is a customer service simulation benchmark where an agent must complete realistic service tasks — issuing refunds, changing flights, updating account plans — by making structured tool calls against a domain-specific policy and database. auto-harness integrates it through the tau2 Python API directly (no subprocess), registeringDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/neosigmaai/auto-harness/llms.txt
Use this file to discover all available pages before exploring further.
HarnessAgent as a custom agent factory in the tau2 registry. The three supported domains together cover 278 tasks: retail (114), airline (50), and telecom (114).
Agent interface
Unlike Terminal-Bench, the tau-bench agent does not control its own tool list. tau2 injects a fixed set of domain tools at runtime — order lookup, flight rebooking, plan change, and similar operations depending on the domain. Youragent/agent.py implements HarnessAgent, which receives those tools and must decide when and how to call them in response to user messages.
The optimization loop can improve the system prompt (AGENT_INSTRUCTION), the message construction logic in generate_next_message(), and the state management in HarnessState. It cannot add new tools for tau-bench runs.
Domains and task counts
| Domain | Tasks | Description |
|---|---|---|
retail | 114 | E-commerce orders, returns, and account management |
airline | 50 | Flight changes, cancellations, and upgrades |
telecom | 114 | Plan changes, billing disputes, and service requests |
domain key in experiment_config.yaml to run one domain at a time.
TauBenchRunner
TauBenchRunner in benchmark.py uses the tau2 Python API directly. It registers HarnessAgent as a custom agent factory under the name "custom_agent" in the tau2 registry, then calls run_domain() with a TextRunConfig.
Constructor
How it works
The runner uses a thread lock (_registry_lock) to safely register HarnessAgent in the tau2 registry once per process, even when called from concurrent contexts:
{task_id: reward} dict built from results.simulations:
Running specific tasks
tau-bench task IDs are integers. Pass them as strings torun():
Data directory
tau2 readsTAU2_DATA_DIR at import time. TauBenchRunner sets this automatically to ./tau2_data/ if the variable is not already set. prepare.py clones the tau2 data repo into that directory on first run.
Configuration
Uncomment and edit the tau-bench block inexperiment_config.yaml:
OPENAI_API_KEY(orANTHROPIC_API_KEYfor Claude models,GEMINI_API_KEYfor Gemini)
Quick start
tau-bench requires Docker for data provisioning. The recommended workflow is viadocker compose.
Run prepare.py
tau-bench uses the split mechanism built into tau2 (
task_split_name in TextRunConfig) rather than a local split file. There is no tbench_data/task_split.json equivalent for tau-bench.Editing agent/agent.py
The tau-bench agent template atagent/templates/tau_bench.py is the starting point. The coding agent can improve:
AGENT_INSTRUCTION— the system prompt describing policy adherence, tool usage, and conversation strategygenerate_next_message()— how the agent constructs its next message given conversation historyHarnessState— state management across multi-turn conversations
AGENT_MODEL and AGENT_REASONING_EFFORT are set by the harness from experiment_config.yaml. Do not hardcode these values in agent/agent.py.