Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/neosigmaai/auto-harness/llms.txt

Use this file to discover all available pages before exploring further.

tau-bench is a customer service benchmark built on the tau2 framework, where agents must complete tasks through structured tool calls against a simulated retail, airline, or telecom backend. auto-harness wraps tau-bench inside Docker so its dependencies are fully isolated, then runs the same optimization loop: baseline → analyze → improve → gate → record → repeat. On this benchmark, the loop improved agent score from 0.56 to 0.78 (~40% jump) through automated failure mining. This page walks through the complete setup.

Requirements

  • Docker (and Docker Compose) — tau-bench and all its dependencies run inside a container
  • OPENAI_API_KEY (or ANTHROPIC_API_KEY / GEMINI_API_KEY depending on your model)
  • A coding agent — Claude Code, Codex CLI, or any agent that can read files and run shell commands
tau-bench data is fetched automatically. On first run, prepare.py clones the tau2-bench repository and extracts the task files for your configured domain into tau2_data/. No manual download is required.

Domains

DomainTasksDescription
retail114E-commerce customer service — orders, returns, product queries
airline50Airline customer service — bookings, cancellations, seat changes
telecom114Telecom customer service — plans, billing, account management
You configure which domain to run in experiment_config.yaml. You can run separate experiments for different domains by using separate checkouts or config files.

Setup

1

Clone the repository

git clone https://github.com/neosigmaai/auto-harness
cd auto-harness
2

Set up environment variables

cp .env.example .env
Open .env and set your LLM API key:
# LLM API keys — set whichever your agent_model needs
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=      # set if using a claude model
GEMINI_API_KEY=         # set if using a gemini model
tau-bench does not require a sandbox provider key — tasks run inside Docker.
3

Configure the experiment

cp experiment_config.yaml.template experiment_config.yaml
Open experiment_config.yaml and uncomment the tau-bench section:
benchmark: "tau-bench"
agent_model: "gpt-5.4"
domain: "retail"               # "retail", "airline", or "telecom"
split: "train"
gate_split: "test"
max_concurrency: 3
threshold: 0.8
reasoning_effort: "medium"     # optional
Start with domain: "retail" — it has the most tasks (114) which gives the optimization loop more training signal. The max_concurrency: 3 default is conservative; tau-bench tasks are sequential tool-call conversations and run quickly.
4

Build the Docker image

docker compose build
This installs tau-bench and all its Python dependencies via uv inside the container. The build only needs to run once (or when you update dependencies).
5

Initialize the workspace and run the baseline

docker compose run autoeval python prepare.py
prepare.py does the following in order:
  1. Validates your LLM API key is set in the environment
  2. Clones tau2-bench and extracts task data for your configured domain into tau2_data/ (if not already present)
  3. Creates workspace/ and initializes suite.json, learnings.md, results.tsv, and train_results.json
  4. Copies agent/templates/tau_bench.py into agent/agent.py as the starting point
  5. Composes PROGRAM.md from program_templates/base.md + program_templates/tau_bench.md
  6. Runs the test split tasks and records the baseline score as iteration 0
Once complete, you will see output like:
[prepare] tau2 data OK: tau2_data (domain=retail)
[prepare] baseline val_score=0.5600 (20/35 passed) — recorded as iteration 0

[prepare] done. Ready to start the optimization loop.
6

Start the optimization loop

Point your coding agent at the repository and use the following prompt:
Read PROGRAM.md and start the optimization loop.
The baseline is already recorded. Start from step 2 (analyze failures).
The agent will:
  1. Run python benchmark.py (inside the Docker container via docker compose run autoeval) to get train-split results
  2. Read train-split traces to diagnose root causes
  3. Edit agent/agent.py with one focused improvement
  4. Run python gating.py to gate the change — three steps: regression suite, full test score, suite promotion
  5. If the gate passes: commit, run python record.py, update workspace/learnings.md
  6. If the gate fails: revert with git checkout agent/agent.py and try a different approach
  7. Repeat

Running individual tasks

To test specific task IDs during development:
docker compose run autoeval python benchmark.py --task-ids <task-id> <task-id>

Running the loop commands

All Python commands run inside the Docker container. The agent/ and workspace/ directories are mounted as volumes, so edits to agent/agent.py on the host are immediately visible inside the container.
# Run full train benchmark
docker compose run autoeval python benchmark.py

# Gate a change
docker compose run autoeval python gating.py

# Record a result
docker compose run autoeval python record.py --val-score 0.6200 --evals-passed 4 --evals-total 5
If your coding agent runs commands directly on the host (not inside the container), configure it to prefix all python commands with docker compose run autoeval. Most agents support this via a wrapper script or a custom tool.

Tracking progress

After each successful gate pass:
  • workspace/results.tsv — iteration history; compare val_score across iterations
  • workspace/learnings.md — what the agent tried, what worked, requests to the human
  • workspace/suite.json — the growing regression suite of tasks the agent must always pass
To see what changed from the starting template:
diff agent/templates/tau_bench.py agent/agent.py

Build docs developers (and LLMs) love