Documentation Index
Fetch the complete documentation index at: https://mintlify.com/neosigmaai/auto-harness/llms.txt
Use this file to discover all available pages before exploring further.
experiment_config.yaml is the single file that controls how auto-harness runs a benchmark. It sets the benchmark name, model, split strategy, concurrency, and all benchmark-specific options. prepare.py, benchmark.py, and gating.py all read this file at startup. Copy the template to get started:
Shared parameters
These keys apply to all three benchmarks.The benchmark to run. Must be one of
"tau-bench", "terminal-bench", or "bird-interact". This key determines which runner and agent template prepare.py installs.The model identifier passed to the agent. Examples:
"gpt-5.4", "anthropic/claude-sonnet-4-20250514". Falls back to the AGENT_MODEL environment variable if not set, which itself defaults to "gpt-5.4".The benchmark split used for training runs (
python benchmark.py). Normally "train". The train/test split is generated automatically by prepare.py on first run using a 70/30 stratified random split.The benchmark split used for the gate’s Step 2 full benchmark run (
python gating.py). Normally "test". Test traces are never saved to disk to prevent the coding agent from reading them.Maximum number of tasks run in parallel. Defaults differ by benchmark:
50 for Terminal-Bench, 3 for tau-bench and BIRD-Interact. Set lower if you hit rate limits or sandbox quotas.The regression suite pass rate threshold for gating Step 1. A value of
0.8 means 80% of tasks in suite.json must pass. Lowering this makes the gate more permissive; raising it makes it stricter.Optional. Controls the model’s reasoning depth. Accepted values:
"low", "medium", "high". When set, the value is written to the AGENT_REASONING_EFFORT environment variable before each run. Not supported by all models.When
true (the default), gating.py and record.py reject iterations that touch any tracked file outside the allowlist (agent/agent.py, PROGRAM.md). Set to false to disable the guard — for example, in a fresh repo with no git history.Terminal-Bench parameters
These keys apply only whenbenchmark: "terminal-bench".
The sandbox provider used to execute tasks. Must be one of
"e2b", "daytona", or "docker". "e2b" and "daytona" require their respective API keys. "docker" requires no key but needs Docker installed locally.Seconds allowed per task before it is treated as a timeout. Tasks that exceed this limit score
0.0 in val_score. The value is converted to a Harbor timeout multiplier internally (Harbor’s default is 180 seconds).The Harbor dataset identifier. The default
"terminal-bench@2.0" runs the standard 89-task Terminal-Bench 2.0 suite. Override only if you are targeting a custom Harbor dataset.Terminal-Bench requires the
harbor CLI. Install it with uv tool install harbor. prepare.py checks for the binary at startup and exits with an error if it is not found.BIRD-Interact parameters
These keys apply only whenbenchmark: "bird-interact".
The interaction mode.
"a-interact" runs the autonomous tool-using SQL agent. "c-interact" runs the clarification-first conversational agent. Defaults to "a-interact" when not set.Which dataset size to use.
"lite" runs 300 tasks; "full" runs 600 tasks. The dataset is downloaded automatically on first run (requires git-lfs).Maximum number of clarification turns allowed per task in
c-interact mode. Passed directly to the BIRD-Interact-ADK orchestrator.Seconds allowed per task. BIRD-Interact tasks involve multi-turn SQL dialogue, so the default is higher than Terminal-Bench. Tasks that time out score
0.0.Local port for the BIRD-Interact system agent service (started by
BirdInteractRunner). Change this if port 6100 is already in use on your machine.Local port for the user simulator service.
Local port for the database environment service.
Optional model identifier for the BIRD-Interact user simulator. Defaults to
agent_model when not set. Example: "anthropic/claude-haiku-4-5-20251001".Advanced overrides
By default,prepare.py auto-provisions the BIRD-Interact-ADK repo, its virtualenv, and the dataset into ./bird_interact_adk/. Use the following keys only if you want to point at an existing installation instead.
Absolute path to an existing BIRD-Interact repo root or
BIRD-Interact-ADK directory. When unset, prepare.py clones the repo into ./bird_interact_adk/ automatically.Absolute path to a Python interpreter that has the BIRD-Interact-ADK dependencies installed. When unset, the runner searches for
.venv-adk/bin/python, .venv/bin/python, and .conda-py310/bin/python inside the ADK directory.Absolute path to the
bird_interact_data.jsonl file. When unset, the runner resolves this from bird_repo and dataset automatically.Postgres connection
BIRD-Interact uses a Dockerized Postgres container provisioned byprepare.py. Override the connection settings only if you are pointing at an existing Postgres instance.
Postgres host. Defaults to
127.0.0.1.Postgres port. Defaults to
5432.Postgres username. Defaults to
root.Postgres password. Defaults to
123123.tau-bench parameters
These keys apply only whenbenchmark: "tau-bench".
The tau-bench domain to run. Must be one of
"retail", "airline", or "telecom". This key is required — gating.py exits with an error if it is not set.Optional model identifier for the tau-bench user simulator. When not set, defaults to the value of
agent_model.Config examples
- Terminal-Bench
- BIRD-Interact
- tau-bench