experiment_config.yaml: complete configuration reference

experiment_config.yaml is the single file that controls how auto-harness runs a benchmark. It sets the benchmark name, model, split strategy, concurrency, and all benchmark-specific options. prepare.py, benchmark.py, and gating.py all read this file at startup. Copy the template to get started:

cp experiment_config.yaml.template experiment_config.yaml

Shared parameters

These keys apply to all three benchmarks.

benchmark

string

required

The benchmark to run. Must be one of "tau-bench", "terminal-bench", or "bird-interact". This key determines which runner and agent template prepare.py installs.

agent_model

string

The model identifier passed to the agent. Examples: "gpt-5.4", "anthropic/claude-sonnet-4-20250514". Falls back to the AGENT_MODEL environment variable if not set, which itself defaults to "gpt-5.4".

split

string

default:"train"

The benchmark split used for training runs (python benchmark.py). Normally "train". The train/test split is generated automatically by prepare.py on first run using a 70/30 stratified random split.

gate_split

string

default:"test"

The benchmark split used for the gate’s Step 2 full benchmark run (python gating.py). Normally "test". Test traces are never saved to disk to prevent the coding agent from reading them.

max_concurrency

integer

Maximum number of tasks run in parallel. Defaults differ by benchmark: 50 for Terminal-Bench, 3 for tau-bench and BIRD-Interact. Set lower if you hit rate limits or sandbox quotas.

threshold

number

default:"0.8"

The regression suite pass rate threshold for gating Step 1. A value of 0.8 means 80% of tasks in suite.json must pass. Lowering this makes the gate more permissive; raising it makes it stricter.

reasoning_effort

string

Optional. Controls the model’s reasoning depth. Accepted values: "low", "medium", "high". When set, the value is written to the AGENT_REASONING_EFFORT environment variable before each run. Not supported by all models.

file_guard

boolean

default:"true"

When true (the default), gating.py and record.py reject iterations that touch any tracked file outside the allowlist (agent/agent.py, PROGRAM.md). Set to false to disable the guard — for example, in a fresh repo with no git history.

Terminal-Bench parameters

These keys apply only when benchmark: "terminal-bench".

env_provider

string

The sandbox provider used to execute tasks. Must be one of "e2b", "daytona", or "docker". "e2b" and "daytona" require their respective API keys. "docker" requires no key but needs Docker installed locally.

per_task_timeout

integer

default:"1200"

Seconds allowed per task before it is treated as a timeout. Tasks that exceed this limit score 0.0 in val_score. The value is converted to a Harbor timeout multiplier internally (Harbor’s default is 180 seconds).

dataset

string

default:"terminal-bench@2.0"

The Harbor dataset identifier. The default "terminal-bench@2.0" runs the standard 89-task Terminal-Bench 2.0 suite. Override only if you are targeting a custom Harbor dataset.

Terminal-Bench requires the harbor CLI. Install it with uv tool install harbor. prepare.py checks for the binary at startup and exits with an error if it is not found.

BIRD-Interact parameters

These keys apply only when benchmark: "bird-interact".

mode

string

The interaction mode. "a-interact" runs the autonomous tool-using SQL agent. "c-interact" runs the clarification-first conversational agent. Defaults to "a-interact" when not set.

dataset

string

default:"lite"

Which dataset size to use. "lite" runs 300 tasks; "full" runs 600 tasks. The dataset is downloaded automatically on first run (requires git-lfs).

patience

integer

default:"3"

Maximum number of clarification turns allowed per task in c-interact mode. Passed directly to the BIRD-Interact-ADK orchestrator.

per_task_timeout

integer

default:"1800"

Seconds allowed per task. BIRD-Interact tasks involve multi-turn SQL dialogue, so the default is higher than Terminal-Bench. Tasks that time out score 0.0.

system_agent_port

integer

default:"6100"

Local port for the BIRD-Interact system agent service (started by BirdInteractRunner). Change this if port 6100 is already in use on your machine.

user_sim_port

integer

default:"6101"

Local port for the user simulator service.

db_env_port

integer

default:"6102"

Local port for the database environment service.

user_model

string

Optional model identifier for the BIRD-Interact user simulator. Defaults to agent_model when not set. Example: "anthropic/claude-haiku-4-5-20251001".

Advanced overrides

By default, prepare.py auto-provisions the BIRD-Interact-ADK repo, its virtualenv, and the dataset into ./bird_interact_adk/. Use the following keys only if you want to point at an existing installation instead.

bird_repo

string

Absolute path to an existing BIRD-Interact repo root or BIRD-Interact-ADK directory. When unset, prepare.py clones the repo into ./bird_interact_adk/ automatically.

bird_python_bin

string

Absolute path to a Python interpreter that has the BIRD-Interact-ADK dependencies installed. When unset, the runner searches for .venv-adk/bin/python, .venv/bin/python, and .conda-py310/bin/python inside the ADK directory.

bird_data_path

string

Absolute path to the bird_interact_data.jsonl file. When unset, the runner resolves this from bird_repo and dataset automatically.

Postgres connection

BIRD-Interact uses a Dockerized Postgres container provisioned by prepare.py. Override the connection settings only if you are pointing at an existing Postgres instance.

pg_host

string

Postgres host. Defaults to 127.0.0.1.

pg_port

integer

Postgres port. Defaults to 5432.

pg_user

string

Postgres username. Defaults to root.

pg_password

string

Postgres password. Defaults to 123123.

On first run with BIRD-Interact, prepare.py checks for ground-truth access. The public dataset ships without gold SQL answers. Email bird.bench25@gmail.com with subject [bird-interact-lite GT&Test Cases] to receive the ground truth, then merge it using the scripts/combine_public_with_gt.py script inside the ADK. prepare.py prints the exact command if the ground truth is missing.

tau-bench parameters

These keys apply only when benchmark: "tau-bench".

domain

string

required

The tau-bench domain to run. Must be one of "retail", "airline", or "telecom". This key is required — gating.py exits with an error if it is not set.

user_model

string

Optional model identifier for the tau-bench user simulator. When not set, defaults to the value of agent_model.

Config examples

Terminal-Bench
BIRD-Interact
tau-bench

benchmark: "terminal-bench"
agent_model: "gpt-5.4"
split: "train"
gate_split: "test"
env_provider: "e2b"
max_concurrency: 50
threshold: 0.8
reasoning_effort: "medium"
per_task_timeout: 1200

benchmark: "bird-interact"
mode: "a-interact"
dataset: "lite"
split: "train"
gate_split: "test"
max_concurrency: 3
threshold: 0.8
patience: 3
per_task_timeout: 1800
system_agent_port: 6100
user_sim_port: 6101
db_env_port: 6102
agent_model: "anthropic/claude-sonnet-4-20250514"
user_model: "anthropic/claude-haiku-4-5-20251001"

# Advanced overrides — omit these on a fresh install
# bird_repo: "/abs/path/to/BIRD-Interact"
# bird_python_bin: "/abs/path/to/python"
# bird_data_path: "/abs/path/to/bird_interact_data.jsonl"
# pg_host: "127.0.0.1"
# pg_port: 5432
# pg_user: "root"
# pg_password: "123123"

benchmark: "tau-bench"
agent_model: "gpt-5.4"
domain: "retail"
split: "train"
gate_split: "test"
max_concurrency: 3
threshold: 0.8
reasoning_effort: "medium"

Configuration

API Reference

experiment_config.yaml: complete configuration reference

Shared parameters

Terminal-Bench parameters

BIRD-Interact parameters

Advanced overrides

Postgres connection

tau-bench parameters

Config examples

Build docs developers (and LLMs) love

Configuration

API Reference

Documentation Index

​Shared parameters

​Terminal-Bench parameters

​BIRD-Interact parameters

​Advanced overrides

​Postgres connection

​tau-bench parameters

​Config examples

Build docs developers (and LLMs) love

Shared parameters

Terminal-Bench parameters

BIRD-Interact parameters

Advanced overrides

Postgres connection

tau-bench parameters

Config examples