Run auto-harness on tau-bench: setup and quickstart

tau-bench is a customer service benchmark built on the tau2 framework, where agents must complete tasks through structured tool calls against a simulated retail, airline, or telecom backend. auto-harness wraps tau-bench inside Docker so its dependencies are fully isolated, then runs the same optimization loop: baseline → analyze → improve → gate → record → repeat. On this benchmark, the loop improved agent score from 0.56 to 0.78 (~40% jump) through automated failure mining. This page walks through the complete setup.

Requirements

Docker (and Docker Compose) — tau-bench and all its dependencies run inside a container
OPENAI_API_KEY (or ANTHROPIC_API_KEY / GEMINI_API_KEY depending on your model)
A coding agent — Claude Code, Codex CLI, or any agent that can read files and run shell commands

tau-bench data is fetched automatically. On first run, prepare.py clones the tau2-bench repository and extracts the task files for your configured domain into tau2_data/. No manual download is required.

Domains

Domain	Tasks	Description
`retail`	114	E-commerce customer service — orders, returns, product queries
`airline`	50	Airline customer service — bookings, cancellations, seat changes
`telecom`	114	Telecom customer service — plans, billing, account management

You configure which domain to run in experiment_config.yaml. You can run separate experiments for different domains by using separate checkouts or config files.

Setup

Clone the repository

git clone https://github.com/neosigmaai/auto-harness
cd auto-harness

Set up environment variables

cp .env.example .env

Open .env and set your LLM API key:

# LLM API keys — set whichever your agent_model needs
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=      # set if using a claude model
GEMINI_API_KEY=         # set if using a gemini model

tau-bench does not require a sandbox provider key — tasks run inside Docker.

Configure the experiment

cp experiment_config.yaml.template experiment_config.yaml

Open experiment_config.yaml and uncomment the tau-bench section:

benchmark: "tau-bench"
agent_model: "gpt-5.4"
domain: "retail"               # "retail", "airline", or "telecom"
split: "train"
gate_split: "test"
max_concurrency: 3
threshold: 0.8
reasoning_effort: "medium"     # optional

Start with domain: "retail" — it has the most tasks (114) which gives the optimization loop more training signal. The max_concurrency: 3 default is conservative; tau-bench tasks are sequential tool-call conversations and run quickly.

Build the Docker image

docker compose build

This installs tau-bench and all its Python dependencies via uv inside the container. The build only needs to run once (or when you update dependencies).

Initialize the workspace and run the baseline

docker compose run autoeval python prepare.py

prepare.py does the following in order:

Validates your LLM API key is set in the environment
Clones tau2-bench and extracts task data for your configured domain into tau2_data/ (if not already present)
Creates workspace/ and initializes suite.json, learnings.md, results.tsv, and train_results.json
Copies agent/templates/tau_bench.py into agent/agent.py as the starting point
Composes PROGRAM.md from program_templates/base.md + program_templates/tau_bench.md
Runs the test split tasks and records the baseline score as iteration 0

Once complete, you will see output like:

[prepare] tau2 data OK: tau2_data (domain=retail)
[prepare] baseline val_score=0.5600 (20/35 passed) — recorded as iteration 0

[prepare] done. Ready to start the optimization loop.

Start the optimization loop

Point your coding agent at the repository and use the following prompt:

Read PROGRAM.md and start the optimization loop.
The baseline is already recorded. Start from step 2 (analyze failures).

The agent will:

Run python benchmark.py (inside the Docker container via docker compose run autoeval) to get train-split results
Read train-split traces to diagnose root causes
Edit agent/agent.py with one focused improvement
Run python gating.py to gate the change — three steps: regression suite, full test score, suite promotion
If the gate passes: commit, run python record.py, update workspace/learnings.md
If the gate fails: revert with git checkout agent/agent.py and try a different approach
Repeat

Running individual tasks

To test specific task IDs during development:

docker compose run autoeval python benchmark.py --task-ids <task-id> <task-id>

Running the loop commands

All Python commands run inside the Docker container. The agent/ and workspace/ directories are mounted as volumes, so edits to agent/agent.py on the host are immediately visible inside the container.

# Run full train benchmark
docker compose run autoeval python benchmark.py

# Gate a change
docker compose run autoeval python gating.py

# Record a result
docker compose run autoeval python record.py --val-score 0.6200 --evals-passed 4 --evals-total 5

If your coding agent runs commands directly on the host (not inside the container), configure it to prefix all python commands with docker compose run autoeval. Most agents support this via a wrapper script or a custom tool.

Tracking progress

After each successful gate pass:

workspace/results.tsv — iteration history; compare val_score across iterations
workspace/learnings.md — what the agent tried, what worked, requests to the human
workspace/suite.json — the growing regression suite of tasks the agent must always pass

To see what changed from the starting template:

diff agent/templates/tau_bench.py agent/agent.py

Get Started

Core Concepts

Benchmarks

Extending

Run auto-harness on tau-bench: setup and quickstart

Requirements

Domains

Setup

Running individual tasks

Running the loop commands

Tracking progress

Build docs developers (and LLMs) love

Get Started

Core Concepts

Benchmarks

Extending

Documentation Index

​Requirements

​Domains

​Setup

​Running individual tasks

​Running the loop commands

​Tracking progress

Build docs developers (and LLMs) love

Requirements

Domains

Setup

Running individual tasks

Running the loop commands

Tracking progress