Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/neosigmaai/auto-harness/llms.txt

Use this file to discover all available pages before exploring further.

benchmark.py is the benchmark execution layer for auto-harness. It defines an abstract BenchmarkRunner base class and three concrete implementations — TauBenchRunner, TerminalBenchRunner, and BirdInteractRunner — each wrapping a different evaluation backend. Both gating.py and the coding agent call this module directly to measure agent performance.

BenchmarkRunner

BenchmarkRunner is the abstract base class all runners inherit from. Subclass it and implement run to plug in a custom benchmark.
from benchmark import BenchmarkRunner

class MyRunner(BenchmarkRunner):
    def run(self, task_ids=None):
        # your benchmark logic here
        return {"task_1": 1.0, "task_2": 0.0}

Methods

run

@abstractmethod
def run(self, task_ids: list[str] | None = None) -> dict[str, float | None]
Run the benchmark on the given tasks. Must be implemented by all subclasses.
task_ids
list[str] | None
default:"None"
Specific task IDs to run. Pass None to run the full benchmark.
Returns: A mapping of task_id → reward. Reward is a float in [0.0, 1.0], or None if the task timed out or produced no verifier result. None counts as 0.0 in val_score.

val_score

def val_score(self, results: dict[str, float | None]) -> float
Compute the mean reward across all results. None rewards are counted as 0.0.
results
dict[str, float | None]
required
The results dict returned by run.
Returns: Mean reward as a float. Returns 0.0 if results is empty.

TauBenchRunner

TauBenchRunner runs the tau-bench benchmark using the tau2 Python API directly — no subprocess.
from benchmark import TauBenchRunner

runner = TauBenchRunner(domain="retail", split="test")
results = runner.run()                            # full benchmark
results = runner.run(task_ids=["0", "1", "42"])  # specific tasks
val = runner.val_score(results)

Constructor

TauBenchRunner(
    domain: str,
    agent_model: str | None = None,
    split: str = "test",
    max_concurrency: int = 3,
    seed: int = 300,
    reasoning_effort: str | None = None,
    user_model: str | None = None,
)
domain
str
required
The tau-bench domain to evaluate. Valid values include "retail", "airline", and "telecom".
agent_model
str | None
default:"None"
Model identifier for the agent LLM. Defaults to the AGENT_MODEL environment variable, or "gpt-5.4" if that is unset.
split
str
default:"\"test\""
Dataset split to use for evaluation.
max_concurrency
int
default:"3"
Maximum number of simultaneous task simulations.
seed
int
default:"300"
Random seed passed to tau2 for reproducible simulation ordering.
reasoning_effort
str | None
default:"None"
Sets the AGENT_REASONING_EFFORT environment variable before running. Accepted values depend on the model provider.
user_model
str | None
default:"None"
Model for the user simulator. Defaults to agent_model when not set.
TAU2_DATA_DIR is set automatically to ./tau2_data/ if not already in the environment. Run prepare.py to ensure the data directory is populated before calling run.

TerminalBenchRunner

TerminalBenchRunner runs Terminal-Bench 2.0 via the Harbor framework, invoking harbor run as a subprocess and parsing per-task result.json files from the output directory.
from benchmark import TerminalBenchRunner

runner = TerminalBenchRunner(split="train")
results = runner.run()                                      # full split
results = runner.run(task_ids=["cobol-modernization"])      # specific tasks
val = runner.val_score(results)

Class variables

VariableValue
SPLIT_FILE"tbench_data/task_split.json"

Constructor

TerminalBenchRunner(
    agent_model: str | None = None,
    split: str | None = "train",
    env_provider: str = "e2b",
    n_concurrent: int = 50,
    dataset: str = "terminal-bench@2.0",
    agent_import_path: str = "agent.agent:HarnessAgent",
    per_task_timeout: int = 1200,
    jobs_dir: str = "workspace/tbench_jobs",
    reasoning_effort: str | None = None,
)
agent_model
str | None
default:"None"
Model identifier for the agent. Defaults to the AGENT_MODEL environment variable, or "gpt-5.4".
split
str | None
default:"\"train\""
Split name to look up in SPLIT_FILE. Pass None to run all tasks in the dataset, bypassing the split file entirely. The split file must exist (via prepare.py) for any named split.
env_provider
str
default:"\"e2b\""
Sandbox provider passed to Harbor. Options: "e2b", "daytona", "docker".
n_concurrent
int
default:"50"
Number of tasks to run concurrently inside Harbor.
dataset
str
default:"\"terminal-bench@2.0\""
Harbor dataset identifier.
agent_import_path
str
default:"\"agent.agent:HarnessAgent\""
Python import path for the agent class, passed to Harbor as --agent-import-path.
per_task_timeout
int
default:"1200"
Per-task timeout in seconds. Used to compute the overall subprocess timeout for the Harbor invocation.
jobs_dir
str
default:"\"workspace/tbench_jobs\""
Directory where Harbor writes per-job output subdirectories. Old job directories from previous runs are pruned automatically after each run.
reasoning_effort
str | None
default:"None"
Sets AGENT_REASONING_EFFORT in the subprocess environment before running Harbor.
Trace saving is disabled (HARNESS_SAVE_TRACE=0) for all non-train splits. This prevents the coding agent from reading test-split traces. Only split="train" copies traces into workspace/traces/latest/ and workspace/traces/baseline/.

BirdInteractRunner

BirdInteractRunner runs the BIRD-Interact benchmark via the external BIRD-Interact-ADK repository. It starts three microservices (system agent, user simulator, database environment) and then invokes orchestrator.runner as a subprocess.
from benchmark import BirdInteractRunner

runner = BirdInteractRunner(
    split="train",
    mode="a-interact",
    dataset="lite",
    agent_model="gpt-5.4",
)
results = runner.run()
val = runner.val_score(results)

Class variables

VariableValue
SPLIT_FILE"bird_data/task_split.json"

Constructor

BirdInteractRunner(
    bird_repo: str | None = None,
    bird_python_bin: str | None = None,
    split: str | None = "train",
    mode: str = "a-interact",
    dataset: str = "lite",
    data_path: str | None = None,
    agent_model: str | None = None,
    user_model: str | None = None,
    patience: int = 3,
    n_concurrent: int = 3,
    per_task_timeout: int = 1800,
    jobs_dir: str = "workspace/bird_runs",
    system_agent_port: int = 6100,
    user_sim_port: int = 6101,
    db_env_port: int = 6102,
    pg_host: str | None = None,
    pg_port: int | None = None,
    pg_user: str | None = None,
    pg_password: str | None = None,
)
bird_repo
str | None
default:"None"
Path to the BIRD-Interact or BIRD-Interact-ADK directory. Passed to resolve_bird_adk_dir. Falls back to the BIRD_REPO environment variable and auto-provisioned locations.
bird_python_bin
str | None
default:"None"
Path to a Python interpreter that has the BIRD-Interact-ADK dependencies installed. Passed to resolve_bird_python_bin. Checks .venv-adk/, .venv/, and .conda-py310/ inside the ADK directory automatically.
split
str | None
default:"\"train\""
Split name to look up in SPLIT_FILE. Pass None to run all tasks in the dataset.
mode
str
default:"\"a-interact\""
Orchestrator run mode passed to orchestrator.runner --mode.
dataset
str
default:"\"lite\""
Dataset variant. Used to construct the default data path (bird-interact-<dataset>/bird_interact_data.jsonl).
data_path
str | None
default:"None"
Explicit path to bird_interact_data.jsonl. Overrides the default derived from dataset. Passed to resolve_bird_data_path.
agent_model
str | None
default:"None"
Sets SYSTEM_AGENT_MODEL in the subprocess environment.
user_model
str | None
default:"None"
Sets USER_SIM_MODEL in the subprocess environment.
patience
int
default:"3"
Number of retry attempts per task. Sets the PATIENCE environment variable.
n_concurrent
int
default:"3"
Number of tasks to run in parallel via orchestrator.runner --concurrency.
per_task_timeout
int
default:"1800"
Per-task timeout in seconds. Used to compute the overall subprocess timeout.
jobs_dir
str
default:"\"workspace/bird_runs\""
Directory for service logs and temporary input/output files. Stale temporary files from previous runs are pruned automatically.
system_agent_port
int
default:"6100"
Local port for the system agent microservice.
user_sim_port
int
default:"6101"
Local port for the user simulator microservice.
db_env_port
int
default:"6102"
Local port for the database environment microservice.
pg_host
str | None
default:"None"
PostgreSQL host. Sets PG_HOST in the subprocess environment.
pg_port
int | None
default:"None"
PostgreSQL port. Sets PG_PORT in the subprocess environment.
pg_user
str | None
default:"None"
PostgreSQL username. Sets PG_USER in the subprocess environment.
pg_password
str | None
default:"None"
PostgreSQL password. Sets PG_PASSWORD in the subprocess environment.
The constructor calls resolve_bird_adk_dir and resolve_bird_python_bin immediately and raises FileNotFoundError if either fails. Run prepare.py to auto-provision the ADK into ./bird_interact_adk/ before constructing this runner.

Helper functions

These module-level functions are used internally by BirdInteractRunner but are also available for direct use.

resolve_bird_adk_dir

def resolve_bird_adk_dir(configured_path: str | None = None) -> str
Resolve the BIRD-Interact-ADK directory from a repo root or direct path. Searches in order: configured_path, BIRD_REPO env var, ./bird_interact_adk/BIRD-Interact-ADK/, ./bird_interact_adk/, ../BIRD-Interact/, ../BIRD-Interact/BIRD-Interact-ADK/, ./BIRD-Interact-ADK/.
configured_path
str | None
default:"None"
An explicit path to try first, before falling back to the default search candidates.
Returns: Absolute path to the BIRD-Interact-ADK directory (must contain orchestrator/runner.py). Raises: FileNotFoundError if no valid ADK directory is found.

resolve_bird_python_bin

def resolve_bird_python_bin(adk_dir: str, configured_python: str | None = None) -> str | None
Pick a Python interpreter that has the BIRD-Interact-ADK dependencies installed. Searches in order: configured_python, BIRD_PYTHON_BIN env var, <adk_dir>/.venv-adk/bin/python, <adk_dir>/.venv/bin/python, <adk_dir>/.conda-py310/bin/python, python3 on PATH, python on PATH.
adk_dir
str
required
Path to the BIRD-Interact-ADK directory, as returned by resolve_bird_adk_dir.
configured_python
str | None
default:"None"
An explicit interpreter path to try first.
Returns: Absolute path to the Python interpreter, or None if no valid interpreter is found.

resolve_bird_data_path

def resolve_bird_data_path(
    adk_dir: str,
    dataset: str = "lite",
    configured_data_path: str | None = None,
) -> str
Resolve the bird_interact_data.jsonl path for the given dataset variant.
adk_dir
str
required
Path to the BIRD-Interact-ADK directory.
dataset
str
default:"\"lite\""
Dataset variant name. The default path is <adk_dir>/bird-interact-<dataset>/bird_interact_data.jsonl.
configured_data_path
str | None
default:"None"
If provided, returns this path directly (after abspath + expanduser), ignoring adk_dir and dataset.
Returns: Absolute path to bird_interact_data.jsonl.

CLI usage

benchmark.py can also be invoked directly from the command line. It reads experiment_config.yaml to determine the benchmark type and constructs the appropriate runner automatically.
# Run the train split (default)
python benchmark.py

# Run the test split
python benchmark.py --split test

# Run specific tasks only
python benchmark.py --task-ids 0 1 42

# Override concurrency
python benchmark.py --split train --concurrency 10

# tau-bench: override domain on the command line
python benchmark.py --domain airline --split test
Results are printed to stdout and saved to workspace/train_results.json.

Build docs developers (and LLMs) love