tau-bench: customer service agent simulation benchmark

tau-bench is a customer service simulation benchmark where an agent must complete realistic service tasks — issuing refunds, changing flights, updating account plans — by making structured tool calls against a domain-specific policy and database. auto-harness integrates it through the tau2 Python API directly (no subprocess), registering HarnessAgent as a custom agent factory in the tau2 registry. The three supported domains together cover 278 tasks: retail (114), airline (50), and telecom (114).

Agent interface

Unlike Terminal-Bench, the tau-bench agent does not control its own tool list. tau2 injects a fixed set of domain tools at runtime — order lookup, flight rebooking, plan change, and similar operations depending on the domain. Your agent/agent.py implements HarnessAgent, which receives those tools and must decide when and how to call them in response to user messages. The optimization loop can improve the system prompt (AGENT_INSTRUCTION), the message construction logic in generate_next_message(), and the state management in HarnessState. It cannot add new tools for tau-bench runs.

Domains and task counts

Domain	Tasks	Description
`retail`	114	E-commerce orders, returns, and account management
`airline`	50	Flight changes, cancellations, and upgrades
`telecom`	114	Plan changes, billing disputes, and service requests

Set the domain key in experiment_config.yaml to run one domain at a time.

TauBenchRunner

TauBenchRunner in benchmark.py uses the tau2 Python API directly. It registers HarnessAgent as a custom agent factory under the name "custom_agent" in the tau2 registry, then calls run_domain() with a TextRunConfig.

Constructor

TauBenchRunner(
    domain: str,                          # "retail", "airline", or "telecom"
    agent_model: str | None = None,       # default: env AGENT_MODEL or "gpt-5.4"
    split: str = "test",                  # tau2 split name
    max_concurrency: int = 3,             # simultaneous simulations
    seed: int = 300,                      # random seed for reproducibility
    reasoning_effort: str | None = None,  # passed as AGENT_REASONING_EFFORT
    user_model: str | None = None,        # model for the user simulator; defaults to agent_model
)

How it works

The runner uses a thread lock (_registry_lock) to safely register HarnessAgent in the tau2 registry once per process, even when called from concurrent contexts:

from tau2.data_model.simulation import TextRunConfig
from tau2 import registry
from tau2.run import run_domain
from agent.agent import HarnessAgent

def _create_harness_agent(tools, domain_policy, **kwargs):
    return HarnessAgent(
        tools=tools,
        domain_policy=domain_policy,
        llm=kwargs.get("llm"),
        llm_args=kwargs.get("llm_args"),
    )

with _registry_lock:
    if registry.get_agent_factory("custom_agent") is None:
        registry.register_agent_factory(_create_harness_agent, "custom_agent")

config = TextRunConfig(
    domain=self.domain,
    agent="custom_agent",
    llm_agent=self.agent_model,
    llm_user=self.user_model,
    task_split_name=self.split,
    task_ids=task_ids,
    max_concurrency=self.max_concurrency,
    seed=self.seed,
)
results = run_domain(config)

The return value is a {task_id: reward} dict built from results.simulations:

return {
    str(sim.task_id): float(sim.reward_info.reward) if sim.reward_info else 0.0
    for sim in results.simulations
}

Running specific tasks

tau-bench task IDs are integers. Pass them as strings to run():

python benchmark.py --task-ids 0 1 42

runner = TauBenchRunner(domain="retail", split="train")
results = runner.run(task_ids=["0", "1", "42"])

Data directory

tau2 reads TAU2_DATA_DIR at import time. TauBenchRunner sets this automatically to ./tau2_data/ if the variable is not already set. prepare.py clones the tau2 data repo into that directory on first run.

Configuration

Uncomment and edit the tau-bench block in experiment_config.yaml:

benchmark: "tau-bench"
agent_model: "gpt-5.4"
domain: "retail"               # "retail", "airline", or "telecom"
split: "train"
gate_split: "test"
max_concurrency: 3
threshold: 0.8
reasoning_effort: "medium"     # optional

Required environment variables:

OPENAI_API_KEY (or ANTHROPIC_API_KEY for Claude models, GEMINI_API_KEY for Gemini)

Quick start

tau-bench requires Docker for data provisioning. The recommended workflow is via docker compose.

Set environment variables

cp .env.example .env
# Set OPENAI_API_KEY in .env

Configure the experiment

cp experiment_config.yaml.template experiment_config.yaml
# Uncomment the tau-bench section and set your domain and model

Build the Docker image

docker compose build

This installs tau-bench and all dependencies via uv inside the container.

Run prepare.py

docker compose run autoeval python prepare.py

This clones tau2 data, copies agent and program templates, and records the baseline score.

Start the optimization loop

Point your coding agent at the repo and prompt:

Read PROGRAM.md and start the optimization loop.
The baseline is already recorded. Start from step 2 (analyze failures).

tau-bench uses the split mechanism built into tau2 (task_split_name in TextRunConfig) rather than a local split file. There is no tbench_data/task_split.json equivalent for tau-bench.

Editing agent/agent.py

The tau-bench agent template at agent/templates/tau_bench.py is the starting point. The coding agent can improve:

AGENT_INSTRUCTION — the system prompt describing policy adherence, tool usage, and conversation strategy
generate_next_message() — how the agent constructs its next message given conversation history
HarnessState — state management across multi-turn conversations

AGENT_MODEL and AGENT_REASONING_EFFORT are set by the harness from experiment_config.yaml. Do not hardcode these values in agent/agent.py.

Get Started

Core Concepts

Benchmarks

Extending

tau-bench: customer service agent simulation benchmark

Agent interface

Domains and task counts

TauBenchRunner

Constructor

How it works

Running specific tasks

Data directory

Configuration

Quick start

Editing agent/agent.py

Build docs developers (and LLMs) love

Get Started

Core Concepts

Benchmarks

Extending

Documentation Index

​Agent interface

​Domains and task counts

​TauBenchRunner

​Constructor

​How it works

​Running specific tasks

​Data directory

​Configuration

​Quick start

​Editing agent/agent.py

Build docs developers (and LLMs) love

Agent interface

Domains and task counts

TauBenchRunner

Constructor

How it works

Running specific tasks

Data directory

Configuration

Quick start

Editing agent/agent.py