Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/neosigmaai/auto-harness/llms.txt

Use this file to discover all available pages before exploring further.

experiment_config.yaml is the single file that controls how auto-harness runs a benchmark. It sets the benchmark name, model, split strategy, concurrency, and all benchmark-specific options. prepare.py, benchmark.py, and gating.py all read this file at startup. Copy the template to get started:
cp experiment_config.yaml.template experiment_config.yaml

Shared parameters

These keys apply to all three benchmarks.
benchmark
string
required
The benchmark to run. Must be one of "tau-bench", "terminal-bench", or "bird-interact". This key determines which runner and agent template prepare.py installs.
agent_model
string
The model identifier passed to the agent. Examples: "gpt-5.4", "anthropic/claude-sonnet-4-20250514". Falls back to the AGENT_MODEL environment variable if not set, which itself defaults to "gpt-5.4".
split
string
default:"train"
The benchmark split used for training runs (python benchmark.py). Normally "train". The train/test split is generated automatically by prepare.py on first run using a 70/30 stratified random split.
gate_split
string
default:"test"
The benchmark split used for the gate’s Step 2 full benchmark run (python gating.py). Normally "test". Test traces are never saved to disk to prevent the coding agent from reading them.
max_concurrency
integer
Maximum number of tasks run in parallel. Defaults differ by benchmark: 50 for Terminal-Bench, 3 for tau-bench and BIRD-Interact. Set lower if you hit rate limits or sandbox quotas.
threshold
number
default:"0.8"
The regression suite pass rate threshold for gating Step 1. A value of 0.8 means 80% of tasks in suite.json must pass. Lowering this makes the gate more permissive; raising it makes it stricter.
reasoning_effort
string
Optional. Controls the model’s reasoning depth. Accepted values: "low", "medium", "high". When set, the value is written to the AGENT_REASONING_EFFORT environment variable before each run. Not supported by all models.
file_guard
boolean
default:"true"
When true (the default), gating.py and record.py reject iterations that touch any tracked file outside the allowlist (agent/agent.py, PROGRAM.md). Set to false to disable the guard — for example, in a fresh repo with no git history.

Terminal-Bench parameters

These keys apply only when benchmark: "terminal-bench".
env_provider
string
The sandbox provider used to execute tasks. Must be one of "e2b", "daytona", or "docker". "e2b" and "daytona" require their respective API keys. "docker" requires no key but needs Docker installed locally.
per_task_timeout
integer
default:"1200"
Seconds allowed per task before it is treated as a timeout. Tasks that exceed this limit score 0.0 in val_score. The value is converted to a Harbor timeout multiplier internally (Harbor’s default is 180 seconds).
dataset
string
default:"terminal-bench@2.0"
The Harbor dataset identifier. The default "terminal-bench@2.0" runs the standard 89-task Terminal-Bench 2.0 suite. Override only if you are targeting a custom Harbor dataset.
Terminal-Bench requires the harbor CLI. Install it with uv tool install harbor. prepare.py checks for the binary at startup and exits with an error if it is not found.

BIRD-Interact parameters

These keys apply only when benchmark: "bird-interact".
mode
string
The interaction mode. "a-interact" runs the autonomous tool-using SQL agent. "c-interact" runs the clarification-first conversational agent. Defaults to "a-interact" when not set.
dataset
string
default:"lite"
Which dataset size to use. "lite" runs 300 tasks; "full" runs 600 tasks. The dataset is downloaded automatically on first run (requires git-lfs).
patience
integer
default:"3"
Maximum number of clarification turns allowed per task in c-interact mode. Passed directly to the BIRD-Interact-ADK orchestrator.
per_task_timeout
integer
default:"1800"
Seconds allowed per task. BIRD-Interact tasks involve multi-turn SQL dialogue, so the default is higher than Terminal-Bench. Tasks that time out score 0.0.
system_agent_port
integer
default:"6100"
Local port for the BIRD-Interact system agent service (started by BirdInteractRunner). Change this if port 6100 is already in use on your machine.
user_sim_port
integer
default:"6101"
Local port for the user simulator service.
db_env_port
integer
default:"6102"
Local port for the database environment service.
user_model
string
Optional model identifier for the BIRD-Interact user simulator. Defaults to agent_model when not set. Example: "anthropic/claude-haiku-4-5-20251001".

Advanced overrides

By default, prepare.py auto-provisions the BIRD-Interact-ADK repo, its virtualenv, and the dataset into ./bird_interact_adk/. Use the following keys only if you want to point at an existing installation instead.
bird_repo
string
Absolute path to an existing BIRD-Interact repo root or BIRD-Interact-ADK directory. When unset, prepare.py clones the repo into ./bird_interact_adk/ automatically.
bird_python_bin
string
Absolute path to a Python interpreter that has the BIRD-Interact-ADK dependencies installed. When unset, the runner searches for .venv-adk/bin/python, .venv/bin/python, and .conda-py310/bin/python inside the ADK directory.
bird_data_path
string
Absolute path to the bird_interact_data.jsonl file. When unset, the runner resolves this from bird_repo and dataset automatically.

Postgres connection

BIRD-Interact uses a Dockerized Postgres container provisioned by prepare.py. Override the connection settings only if you are pointing at an existing Postgres instance.
pg_host
string
Postgres host. Defaults to 127.0.0.1.
pg_port
integer
Postgres port. Defaults to 5432.
pg_user
string
Postgres username. Defaults to root.
pg_password
string
Postgres password. Defaults to 123123.
On first run with BIRD-Interact, prepare.py checks for ground-truth access. The public dataset ships without gold SQL answers. Email bird.bench25@gmail.com with subject [bird-interact-lite GT&Test Cases] to receive the ground truth, then merge it using the scripts/combine_public_with_gt.py script inside the ADK. prepare.py prints the exact command if the ground truth is missing.

tau-bench parameters

These keys apply only when benchmark: "tau-bench".
domain
string
required
The tau-bench domain to run. Must be one of "retail", "airline", or "telecom". This key is required — gating.py exits with an error if it is not set.
user_model
string
Optional model identifier for the tau-bench user simulator. When not set, defaults to the value of agent_model.

Config examples

benchmark: "terminal-bench"
agent_model: "gpt-5.4"
split: "train"
gate_split: "test"
env_provider: "e2b"
max_concurrency: 50
threshold: 0.8
reasoning_effort: "medium"
per_task_timeout: 1200

Build docs developers (and LLMs) love