BenchmarkRunner, TauBenchRunner, TerminalBenchRunner

benchmark.py is the benchmark execution layer for auto-harness. It defines an abstract BenchmarkRunner base class and three concrete implementations — TauBenchRunner, TerminalBenchRunner, and BirdInteractRunner — each wrapping a different evaluation backend. Both gating.py and the coding agent call this module directly to measure agent performance.

BenchmarkRunner

BenchmarkRunner is the abstract base class all runners inherit from. Subclass it and implement run to plug in a custom benchmark.

from benchmark import BenchmarkRunner

class MyRunner(BenchmarkRunner):
    def run(self, task_ids=None):
        # your benchmark logic here
        return {"task_1": 1.0, "task_2": 0.0}

Methods

`run`

@abstractmethod
def run(self, task_ids: list[str] | None = None) -> dict[str, float | None]

Run the benchmark on the given tasks. Must be implemented by all subclasses.

task_ids

list[str] | None

default:"None"

Specific task IDs to run. Pass None to run the full benchmark.

Returns: A mapping of task_id → reward. Reward is a float in [0.0, 1.0], or None if the task timed out or produced no verifier result. None counts as 0.0 in val_score.

`val_score`

def val_score(self, results: dict[str, float | None]) -> float

Compute the mean reward across all results. None rewards are counted as 0.0.

results

dict[str, float | None]

required

The results dict returned by run.

Returns: Mean reward as a float. Returns 0.0 if results is empty.

TauBenchRunner

TauBenchRunner runs the tau-bench benchmark using the tau2 Python API directly — no subprocess.

from benchmark import TauBenchRunner

runner = TauBenchRunner(domain="retail", split="test")
results = runner.run()                            # full benchmark
results = runner.run(task_ids=["0", "1", "42"])  # specific tasks
val = runner.val_score(results)

Constructor

TauBenchRunner(
    domain: str,
    agent_model: str | None = None,
    split: str = "test",
    max_concurrency: int = 3,
    seed: int = 300,
    reasoning_effort: str | None = None,
    user_model: str | None = None,
)

domain

str

required

The tau-bench domain to evaluate. Valid values include "retail", "airline", and "telecom".

agent_model

str | None

default:"None"

Model identifier for the agent LLM. Defaults to the AGENT_MODEL environment variable, or "gpt-5.4" if that is unset.

split

str

default:"\"test\""

Dataset split to use for evaluation.

max_concurrency

int

default:"3"

Maximum number of simultaneous task simulations.

seed

int

default:"300"

Random seed passed to tau2 for reproducible simulation ordering.

reasoning_effort

str | None

default:"None"

Sets the AGENT_REASONING_EFFORT environment variable before running. Accepted values depend on the model provider.

user_model

str | None

default:"None"

Model for the user simulator. Defaults to agent_model when not set.

TAU2_DATA_DIR is set automatically to ./tau2_data/ if not already in the environment. Run prepare.py to ensure the data directory is populated before calling run.

TerminalBenchRunner

TerminalBenchRunner runs Terminal-Bench 2.0 via the Harbor framework, invoking harbor run as a subprocess and parsing per-task result.json files from the output directory.

from benchmark import TerminalBenchRunner

runner = TerminalBenchRunner(split="train")
results = runner.run()                                      # full split
results = runner.run(task_ids=["cobol-modernization"])      # specific tasks
val = runner.val_score(results)

Class variables

Variable	Value
`SPLIT_FILE`	`"tbench_data/task_split.json"`

Constructor

TerminalBenchRunner(
    agent_model: str | None = None,
    split: str | None = "train",
    env_provider: str = "e2b",
    n_concurrent: int = 50,
    dataset: str = "terminal-bench@2.0",
    agent_import_path: str = "agent.agent:HarnessAgent",
    per_task_timeout: int = 1200,
    jobs_dir: str = "workspace/tbench_jobs",
    reasoning_effort: str | None = None,
)

agent_model

str | None

default:"None"

Model identifier for the agent. Defaults to the AGENT_MODEL environment variable, or "gpt-5.4".

split

str | None

default:"\"train\""

Split name to look up in SPLIT_FILE. Pass None to run all tasks in the dataset, bypassing the split file entirely. The split file must exist (via prepare.py) for any named split.

env_provider

str

default:"\"e2b\""

Sandbox provider passed to Harbor. Options: "e2b", "daytona", "docker".

n_concurrent

int

default:"50"

Number of tasks to run concurrently inside Harbor.

dataset

str

default:"\"terminal-bench@2.0\""

Harbor dataset identifier.

agent_import_path

str

default:"\"agent.agent:HarnessAgent\""

Python import path for the agent class, passed to Harbor as --agent-import-path.

per_task_timeout

int

default:"1200"

Per-task timeout in seconds. Used to compute the overall subprocess timeout for the Harbor invocation.

jobs_dir

str

default:"\"workspace/tbench_jobs\""

Directory where Harbor writes per-job output subdirectories. Old job directories from previous runs are pruned automatically after each run.

reasoning_effort

str | None

default:"None"

Sets AGENT_REASONING_EFFORT in the subprocess environment before running Harbor.

Trace saving is disabled (HARNESS_SAVE_TRACE=0) for all non-train splits. This prevents the coding agent from reading test-split traces. Only split="train" copies traces into workspace/traces/latest/ and workspace/traces/baseline/.

BirdInteractRunner

BirdInteractRunner runs the BIRD-Interact benchmark via the external BIRD-Interact-ADK repository. It starts three microservices (system agent, user simulator, database environment) and then invokes orchestrator.runner as a subprocess.

from benchmark import BirdInteractRunner

runner = BirdInteractRunner(
    split="train",
    mode="a-interact",
    dataset="lite",
    agent_model="gpt-5.4",
)
results = runner.run()
val = runner.val_score(results)

Class variables

Variable	Value
`SPLIT_FILE`	`"bird_data/task_split.json"`

Constructor

BirdInteractRunner(
    bird_repo: str | None = None,
    bird_python_bin: str | None = None,
    split: str | None = "train",
    mode: str = "a-interact",
    dataset: str = "lite",
    data_path: str | None = None,
    agent_model: str | None = None,
    user_model: str | None = None,
    patience: int = 3,
    n_concurrent: int = 3,
    per_task_timeout: int = 1800,
    jobs_dir: str = "workspace/bird_runs",
    system_agent_port: int = 6100,
    user_sim_port: int = 6101,
    db_env_port: int = 6102,
    pg_host: str | None = None,
    pg_port: int | None = None,
    pg_user: str | None = None,
    pg_password: str | None = None,
)

bird_repo

str | None

default:"None"

Path to the BIRD-Interact or BIRD-Interact-ADK directory. Passed to resolve_bird_adk_dir. Falls back to the BIRD_REPO environment variable and auto-provisioned locations.

bird_python_bin

str | None

default:"None"

Path to a Python interpreter that has the BIRD-Interact-ADK dependencies installed. Passed to resolve_bird_python_bin. Checks .venv-adk/, .venv/, and .conda-py310/ inside the ADK directory automatically.

split

str | None

default:"\"train\""

Split name to look up in SPLIT_FILE. Pass None to run all tasks in the dataset.

mode

str

default:"\"a-interact\""

Orchestrator run mode passed to orchestrator.runner --mode.

dataset

str

default:"\"lite\""

Dataset variant. Used to construct the default data path (bird-interact-<dataset>/bird_interact_data.jsonl).

data_path

str | None

default:"None"

Explicit path to bird_interact_data.jsonl. Overrides the default derived from dataset. Passed to resolve_bird_data_path.

agent_model

str | None

default:"None"

Sets SYSTEM_AGENT_MODEL in the subprocess environment.

user_model

str | None

default:"None"

Sets USER_SIM_MODEL in the subprocess environment.

patience

int

default:"3"

Number of retry attempts per task. Sets the PATIENCE environment variable.

n_concurrent

int

default:"3"

Number of tasks to run in parallel via orchestrator.runner --concurrency.

per_task_timeout

int

default:"1800"

Per-task timeout in seconds. Used to compute the overall subprocess timeout.

jobs_dir

str

default:"\"workspace/bird_runs\""

Directory for service logs and temporary input/output files. Stale temporary files from previous runs are pruned automatically.

system_agent_port

int

default:"6100"

Local port for the system agent microservice.

user_sim_port

int

default:"6101"

Local port for the user simulator microservice.

db_env_port

int

default:"6102"

Local port for the database environment microservice.

pg_host

str | None

default:"None"

PostgreSQL host. Sets PG_HOST in the subprocess environment.

pg_port

int | None

default:"None"

PostgreSQL port. Sets PG_PORT in the subprocess environment.

pg_user

str | None

default:"None"

PostgreSQL username. Sets PG_USER in the subprocess environment.

pg_password

str | None

default:"None"

PostgreSQL password. Sets PG_PASSWORD in the subprocess environment.

The constructor calls resolve_bird_adk_dir and resolve_bird_python_bin immediately and raises FileNotFoundError if either fails. Run prepare.py to auto-provision the ADK into ./bird_interact_adk/ before constructing this runner.

Helper functions

These module-level functions are used internally by BirdInteractRunner but are also available for direct use.

`resolve_bird_adk_dir`

def resolve_bird_adk_dir(configured_path: str | None = None) -> str

Resolve the BIRD-Interact-ADK directory from a repo root or direct path. Searches in order: configured_path, BIRD_REPO env var, ./bird_interact_adk/BIRD-Interact-ADK/, ./bird_interact_adk/, ../BIRD-Interact/, ../BIRD-Interact/BIRD-Interact-ADK/, ./BIRD-Interact-ADK/.

configured_path

str | None

default:"None"

An explicit path to try first, before falling back to the default search candidates.

Returns: Absolute path to the BIRD-Interact-ADK directory (must contain orchestrator/runner.py). Raises: FileNotFoundError if no valid ADK directory is found.

`resolve_bird_python_bin`

def resolve_bird_python_bin(adk_dir: str, configured_python: str | None = None) -> str | None

Pick a Python interpreter that has the BIRD-Interact-ADK dependencies installed. Searches in order: configured_python, BIRD_PYTHON_BIN env var, <adk_dir>/.venv-adk/bin/python, <adk_dir>/.venv/bin/python, <adk_dir>/.conda-py310/bin/python, python3 on PATH, python on PATH.

adk_dir

str

required

Path to the BIRD-Interact-ADK directory, as returned by resolve_bird_adk_dir.

configured_python

str | None

default:"None"

An explicit interpreter path to try first.

Returns: Absolute path to the Python interpreter, or None if no valid interpreter is found.

`resolve_bird_data_path`

def resolve_bird_data_path(
    adk_dir: str,
    dataset: str = "lite",
    configured_data_path: str | None = None,
) -> str

Resolve the bird_interact_data.jsonl path for the given dataset variant.

adk_dir

str

required

Path to the BIRD-Interact-ADK directory.

dataset

str

default:"\"lite\""

Dataset variant name. The default path is <adk_dir>/bird-interact-<dataset>/bird_interact_data.jsonl.

configured_data_path

str | None

default:"None"

If provided, returns this path directly (after abspath + expanduser), ignoring adk_dir and dataset.

Returns: Absolute path to bird_interact_data.jsonl.

CLI usage

benchmark.py can also be invoked directly from the command line. It reads experiment_config.yaml to determine the benchmark type and constructs the appropriate runner automatically.

# Run the train split (default)
python benchmark.py

# Run the test split
python benchmark.py --split test

# Run specific tasks only
python benchmark.py --task-ids 0 1 42

# Override concurrency
python benchmark.py --split train --concurrency 10

# tau-bench: override domain on the command line
python benchmark.py --domain airline --split test

Results are printed to stdout and saved to workspace/train_results.json.

Configuration

API Reference

BenchmarkRunner, TauBenchRunner, TerminalBenchRunner