Documentation Index
Fetch the complete documentation index at: https://mintlify.com/neosigmaai/auto-harness/llms.txt
Use this file to discover all available pages before exploring further.
benchmark.py is the benchmark execution layer for auto-harness. It defines an abstract BenchmarkRunner base class and three concrete implementations — TauBenchRunner, TerminalBenchRunner, and BirdInteractRunner — each wrapping a different evaluation backend. Both gating.py and the coding agent call this module directly to measure agent performance.
BenchmarkRunner
BenchmarkRunner is the abstract base class all runners inherit from. Subclass it and implement run to plug in a custom benchmark.
Methods
run
Specific task IDs to run. Pass
None to run the full benchmark.task_id → reward. Reward is a float in [0.0, 1.0], or None if the task timed out or produced no verifier result. None counts as 0.0 in val_score.
val_score
None rewards are counted as 0.0.
The results dict returned by
run.float. Returns 0.0 if results is empty.
TauBenchRunner
TauBenchRunner runs the tau-bench benchmark using the tau2 Python API directly — no subprocess.
Constructor
The tau-bench domain to evaluate. Valid values include
"retail", "airline", and "telecom".Model identifier for the agent LLM. Defaults to the
AGENT_MODEL environment variable, or "gpt-5.4" if that is unset.Dataset split to use for evaluation.
Maximum number of simultaneous task simulations.
Random seed passed to tau2 for reproducible simulation ordering.
Sets the
AGENT_REASONING_EFFORT environment variable before running. Accepted values depend on the model provider.Model for the user simulator. Defaults to
agent_model when not set.TAU2_DATA_DIR is set automatically to ./tau2_data/ if not already in the environment. Run prepare.py to ensure the data directory is populated before calling run.TerminalBenchRunner
TerminalBenchRunner runs Terminal-Bench 2.0 via the Harbor framework, invoking harbor run as a subprocess and parsing per-task result.json files from the output directory.
Class variables
| Variable | Value |
|---|---|
SPLIT_FILE | "tbench_data/task_split.json" |
Constructor
Model identifier for the agent. Defaults to the
AGENT_MODEL environment variable, or "gpt-5.4".Split name to look up in
SPLIT_FILE. Pass None to run all tasks in the dataset, bypassing the split file entirely. The split file must exist (via prepare.py) for any named split.Sandbox provider passed to Harbor. Options:
"e2b", "daytona", "docker".Number of tasks to run concurrently inside Harbor.
Harbor dataset identifier.
Python import path for the agent class, passed to Harbor as
--agent-import-path.Per-task timeout in seconds. Used to compute the overall subprocess timeout for the Harbor invocation.
Directory where Harbor writes per-job output subdirectories. Old job directories from previous runs are pruned automatically after each run.
Sets
AGENT_REASONING_EFFORT in the subprocess environment before running Harbor.BirdInteractRunner
BirdInteractRunner runs the BIRD-Interact benchmark via the external BIRD-Interact-ADK repository. It starts three microservices (system agent, user simulator, database environment) and then invokes orchestrator.runner as a subprocess.
Class variables
| Variable | Value |
|---|---|
SPLIT_FILE | "bird_data/task_split.json" |
Constructor
Path to the BIRD-Interact or BIRD-Interact-ADK directory. Passed to
resolve_bird_adk_dir. Falls back to the BIRD_REPO environment variable and auto-provisioned locations.Path to a Python interpreter that has the BIRD-Interact-ADK dependencies installed. Passed to
resolve_bird_python_bin. Checks .venv-adk/, .venv/, and .conda-py310/ inside the ADK directory automatically.Split name to look up in
SPLIT_FILE. Pass None to run all tasks in the dataset.Orchestrator run mode passed to
orchestrator.runner --mode.Dataset variant. Used to construct the default data path (
bird-interact-<dataset>/bird_interact_data.jsonl).Explicit path to
bird_interact_data.jsonl. Overrides the default derived from dataset. Passed to resolve_bird_data_path.Sets
SYSTEM_AGENT_MODEL in the subprocess environment.Sets
USER_SIM_MODEL in the subprocess environment.Number of retry attempts per task. Sets the
PATIENCE environment variable.Number of tasks to run in parallel via
orchestrator.runner --concurrency.Per-task timeout in seconds. Used to compute the overall subprocess timeout.
Directory for service logs and temporary input/output files. Stale temporary files from previous runs are pruned automatically.
Local port for the system agent microservice.
Local port for the user simulator microservice.
Local port for the database environment microservice.
PostgreSQL host. Sets
PG_HOST in the subprocess environment.PostgreSQL port. Sets
PG_PORT in the subprocess environment.PostgreSQL username. Sets
PG_USER in the subprocess environment.PostgreSQL password. Sets
PG_PASSWORD in the subprocess environment.The constructor calls
resolve_bird_adk_dir and resolve_bird_python_bin immediately and raises FileNotFoundError if either fails. Run prepare.py to auto-provision the ADK into ./bird_interact_adk/ before constructing this runner.Helper functions
These module-level functions are used internally byBirdInteractRunner but are also available for direct use.
resolve_bird_adk_dir
configured_path, BIRD_REPO env var, ./bird_interact_adk/BIRD-Interact-ADK/, ./bird_interact_adk/, ../BIRD-Interact/, ../BIRD-Interact/BIRD-Interact-ADK/, ./BIRD-Interact-ADK/.
An explicit path to try first, before falling back to the default search candidates.
orchestrator/runner.py).
Raises: FileNotFoundError if no valid ADK directory is found.
resolve_bird_python_bin
configured_python, BIRD_PYTHON_BIN env var, <adk_dir>/.venv-adk/bin/python, <adk_dir>/.venv/bin/python, <adk_dir>/.conda-py310/bin/python, python3 on PATH, python on PATH.
Path to the BIRD-Interact-ADK directory, as returned by
resolve_bird_adk_dir.An explicit interpreter path to try first.
None if no valid interpreter is found.
resolve_bird_data_path
bird_interact_data.jsonl path for the given dataset variant.
Path to the BIRD-Interact-ADK directory.
Dataset variant name. The default path is
<adk_dir>/bird-interact-<dataset>/bird_interact_data.jsonl.If provided, returns this path directly (after
abspath + expanduser), ignoring adk_dir and dataset.bird_interact_data.jsonl.
CLI usage
benchmark.py can also be invoked directly from the command line. It reads experiment_config.yaml to determine the benchmark type and constructs the appropriate runner automatically.
workspace/train_results.json.