Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/neosigmaai/auto-harness/llms.txt

Use this file to discover all available pages before exploring further.

Terminal-Bench 2.0 is a benchmark of 89 real-world terminal tasks that require an agent to solve practical problems by executing bash commands in an isolated Linux container. Tasks span three categories — software development and build tooling, system administration, and security challenges — making it a rigorous test of whether an agent can operate effectively as an autonomous shell user. auto-harness runs it via the harbor CLI and the TerminalBenchRunner class, which handles task selection, environment setup, result parsing, and trace management.

Agent interface

The agent receives a task description and interacts exclusively through a single bash tool that executes commands in a Harbor-managed container. There are no structured tool schemas beyond this one call, so the agent must plan and verify its work through shell output alone. The starting template at agent/templates/terminal_bench.py defines the initial system prompt, the bash tool schema, and the HarnessAgent.run() loop. The optimization loop edits agent/agent.py (copied from that template by prepare.py) to improve performance.

Environment providers

Harbor supports three sandbox providers. Set env_provider in experiment_config.yaml:
ProviderDescriptionRequired credential
e2bHosted cloud sandboxes via E2BE2B_API_KEY
daytonaHosted sandboxes via DaytonaDAYTONA_API_KEY
dockerLocal Docker containersNone
e2b is the default and recommended option for fast parallel runs. Use docker if you need fully local execution or want to avoid API costs.

TerminalBenchRunner

TerminalBenchRunner in benchmark.py is the concrete BenchmarkRunner subclass for Terminal-Bench 2.0. It invokes harbor run as a subprocess, waits for results, and parses per-task result.json files.

Constructor

TerminalBenchRunner(
    agent_model: str | None = None,       # default: env AGENT_MODEL or "gpt-5.4"
    split: str | None = "train",          # "train", "test", or None (all tasks)
    env_provider: str = "e2b",            # "e2b", "daytona", or "docker"
    n_concurrent: int = 50,               # tasks run in parallel
    dataset: str = "terminal-bench@2.0",  # harbor dataset identifier
    agent_import_path: str = "agent.agent:HarnessAgent",
    per_task_timeout: int = 1200,         # seconds; tasks that exceed this score None
    jobs_dir: str = "workspace/tbench_jobs",
    reasoning_effort: str | None = None,  # passed as AGENT_REASONING_EFFORT
)

Split file

The train/test split is stored at tbench_data/task_split.json (the SPLIT_FILE class constant). This file is created by prepare.py during the baseline run and is never overwritten by subsequent runs.
TerminalBenchRunner.SPLIT_FILE = "tbench_data/task_split.json"

Running specific tasks

Pass a list of task ID strings to run() to execute a subset:
python benchmark.py --task-ids cobol-modernization regex-log
runner = TerminalBenchRunner(split="train")
results = runner.run(task_ids=["cobol-modernization", "regex-log"])

Result schema

Harbor writes a result.json file for each completed task. TerminalBenchRunner expects this exact schema:
{
  "task_name": "<id>",
  "verifier_result": {
    "rewards": {
      "reward": 0.85
    }
  }
}
If verifier_result is absent (the verifier did not run — usually an infrastructure error), the runner records None for that task, which counts as 0.0 in val_score.

Trace management

After each train-split run, the runner copies traces from the Harbor output directory into the workspace:
DirectoryContentsOverwritten?
workspace/traces/latest/Most recent run per taskYes, every run
workspace/traces/baseline/First-run tracesNo — written once
Each task gets a subdirectory with two files:
workspace/traces/latest/<task_name>/
├── trace.json    # full agent conversation (messages, tool calls, outputs)
└── result.json   # reward, duration, verifier output
Only train-split traces are saved. The runner sets HARNESS_SAVE_TRACE=0 for any run where split != "train", preventing the coding agent from reading test data.
The coding agent should only read workspace/traces/latest/. The raw Harbor job output in workspace/tbench_jobs/ contains both train and test data and must not be read directly.

Configuration

Uncomment and edit the terminal-bench block in experiment_config.yaml:
benchmark: "terminal-bench"
agent_model: "gpt-5.4"
split: "train"
gate_split: "test"
env_provider: "e2b"            # "e2b", "daytona", or "docker"
max_concurrency: 50            # tasks run in parallel
threshold: 0.8                 # regression suite pass rate threshold
reasoning_effort: "medium"     # optional
per_task_timeout: 1200         # seconds; tasks that exceed this score 0.0 in val_score
Required environment variables:
  • OPENAI_API_KEY (or ANTHROPIC_API_KEY for Claude models)
  • E2B_API_KEY (if using the e2b provider) or DAYTONA_API_KEY (if using daytona)

Quick start

1

Install harbor

uv tool install harbor
2

Set environment variables

cp .env.example .env
# Set OPENAI_API_KEY and E2B_API_KEY in .env
3

Configure the experiment

cp experiment_config.yaml.template experiment_config.yaml
# Uncomment the terminal-bench section and set your model
4

Run prepare.py

python prepare.py
This runs all 89 tasks, generates the train/test split at tbench_data/task_split.json, and records the baseline score as iteration 0.
5

Start the optimization loop

Point your coding agent at the repo and prompt:
Read PROGRAM.md and start the optimization loop.
The baseline is already recorded. Start from step 2 (analyze failures).

Known techniques that improve scores

The program_templates/terminal_bench.md file documents techniques the coding agent can apply to agent/agent.py:
  • Environment bootstrapping — gather OS info, installed tools, and file listing before starting (+5–10%)
  • Enforced TODO planning — make the model create and maintain a step-by-step plan (+10–20%, largest single gain)
  • Non-interactive mode — never ask clarifying questions, always act (+3–5%)
  • Double-confirmation — verify task completion before declaring done (+3–5%)
  • Forced reasoning in tool schema — add analysis and plan fields to the bash tool definition
To see all changes the coding agent has made relative to the starting template, run diff agent/templates/terminal_bench.py agent/agent.py.

Build docs developers (and LLMs) love