tau-bench is a customer service benchmark built on the tau2 framework, where agents must complete tasks through structured tool calls against a simulated retail, airline, or telecom backend. auto-harness wraps tau-bench inside Docker so its dependencies are fully isolated, then runs the same optimization loop: baseline → analyze → improve → gate → record → repeat. On this benchmark, the loop improved agent score from 0.56 to 0.78 (~40% jump) through automated failure mining. This page walks through the complete setup.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/neosigmaai/auto-harness/llms.txt
Use this file to discover all available pages before exploring further.
Requirements
- Docker (and Docker Compose) — tau-bench and all its dependencies run inside a container
OPENAI_API_KEY(orANTHROPIC_API_KEY/GEMINI_API_KEYdepending on your model)- A coding agent — Claude Code, Codex CLI, or any agent that can read files and run shell commands
tau-bench data is fetched automatically. On first run,
prepare.py clones the tau2-bench repository and extracts the task files for your configured domain into tau2_data/. No manual download is required.Domains
| Domain | Tasks | Description |
|---|---|---|
retail | 114 | E-commerce customer service — orders, returns, product queries |
airline | 50 | Airline customer service — bookings, cancellations, seat changes |
telecom | 114 | Telecom customer service — plans, billing, account management |
experiment_config.yaml. You can run separate experiments for different domains by using separate checkouts or config files.
Setup
Set up environment variables
.env and set your LLM API key:Build the Docker image
uv inside the container. The build only needs to run once (or when you update dependencies).Initialize the workspace and run the baseline
prepare.py does the following in order:- Validates your LLM API key is set in the environment
- Clones
tau2-benchand extracts task data for your configured domain intotau2_data/(if not already present) - Creates
workspace/and initializessuite.json,learnings.md,results.tsv, andtrain_results.json - Copies
agent/templates/tau_bench.pyintoagent/agent.pyas the starting point - Composes
PROGRAM.mdfromprogram_templates/base.md+program_templates/tau_bench.md - Runs the test split tasks and records the baseline score as iteration 0
Start the optimization loop
Point your coding agent at the repository and use the following prompt:The agent will:
- Run
python benchmark.py(inside the Docker container viadocker compose run autoeval) to get train-split results - Read train-split traces to diagnose root causes
- Edit
agent/agent.pywith one focused improvement - Run
python gating.pyto gate the change — three steps: regression suite, full test score, suite promotion - If the gate passes: commit, run
python record.py, updateworkspace/learnings.md - If the gate fails: revert with
git checkout agent/agent.pyand try a different approach - Repeat
Running individual tasks
To test specific task IDs during development:Running the loop commands
All Python commands run inside the Docker container. Theagent/ and workspace/ directories are mounted as volumes, so edits to agent/agent.py on the host are immediately visible inside the container.
If your coding agent runs commands directly on the host (not inside the container), configure it to prefix all
python commands with docker compose run autoeval. Most agents support this via a wrapper script or a custom tool.Tracking progress
After each successful gate pass:workspace/results.tsv— iteration history; compare val_score across iterationsworkspace/learnings.md— what the agent tried, what worked, requests to the humanworkspace/suite.json— the growing regression suite of tasks the agent must always pass