Terminal-Bench 2.0: bash agent benchmark in Harbor

Terminal-Bench 2.0 is a benchmark of 89 real-world terminal tasks that require an agent to solve practical problems by executing bash commands in an isolated Linux container. Tasks span three categories — software development and build tooling, system administration, and security challenges — making it a rigorous test of whether an agent can operate effectively as an autonomous shell user. auto-harness runs it via the harbor CLI and the TerminalBenchRunner class, which handles task selection, environment setup, result parsing, and trace management.

Agent interface

The agent receives a task description and interacts exclusively through a single bash tool that executes commands in a Harbor-managed container. There are no structured tool schemas beyond this one call, so the agent must plan and verify its work through shell output alone. The starting template at agent/templates/terminal_bench.py defines the initial system prompt, the bash tool schema, and the HarnessAgent.run() loop. The optimization loop edits agent/agent.py (copied from that template by prepare.py) to improve performance.

Environment providers

Harbor supports three sandbox providers. Set env_provider in experiment_config.yaml:

Provider	Description	Required credential
`e2b`	Hosted cloud sandboxes via E2B	`E2B_API_KEY`
`daytona`	Hosted sandboxes via Daytona	`DAYTONA_API_KEY`
`docker`	Local Docker containers	None

e2b is the default and recommended option for fast parallel runs. Use docker if you need fully local execution or want to avoid API costs.

TerminalBenchRunner

TerminalBenchRunner in benchmark.py is the concrete BenchmarkRunner subclass for Terminal-Bench 2.0. It invokes harbor run as a subprocess, waits for results, and parses per-task result.json files.

Constructor

TerminalBenchRunner(
    agent_model: str | None = None,       # default: env AGENT_MODEL or "gpt-5.4"
    split: str | None = "train",          # "train", "test", or None (all tasks)
    env_provider: str = "e2b",            # "e2b", "daytona", or "docker"
    n_concurrent: int = 50,               # tasks run in parallel
    dataset: str = "terminal-bench@2.0",  # harbor dataset identifier
    agent_import_path: str = "agent.agent:HarnessAgent",
    per_task_timeout: int = 1200,         # seconds; tasks that exceed this score None
    jobs_dir: str = "workspace/tbench_jobs",
    reasoning_effort: str | None = None,  # passed as AGENT_REASONING_EFFORT
)

Split file

The train/test split is stored at tbench_data/task_split.json (the SPLIT_FILE class constant). This file is created by prepare.py during the baseline run and is never overwritten by subsequent runs.

TerminalBenchRunner.SPLIT_FILE = "tbench_data/task_split.json"

Running specific tasks

Pass a list of task ID strings to run() to execute a subset:

python benchmark.py --task-ids cobol-modernization regex-log

runner = TerminalBenchRunner(split="train")
results = runner.run(task_ids=["cobol-modernization", "regex-log"])

Result schema

Harbor writes a result.json file for each completed task. TerminalBenchRunner expects this exact schema:

{
  "task_name": "<id>",
  "verifier_result": {
    "rewards": {
      "reward": 0.85
    }
  }
}

If verifier_result is absent (the verifier did not run — usually an infrastructure error), the runner records None for that task, which counts as 0.0 in val_score.

Trace management

After each train-split run, the runner copies traces from the Harbor output directory into the workspace:

Directory	Contents	Overwritten?
`workspace/traces/latest/`	Most recent run per task	Yes, every run
`workspace/traces/baseline/`	First-run traces	No — written once

Each task gets a subdirectory with two files:

workspace/traces/latest/<task_name>/
├── trace.json    # full agent conversation (messages, tool calls, outputs)
└── result.json   # reward, duration, verifier output

Only train-split traces are saved. The runner sets HARNESS_SAVE_TRACE=0 for any run where split != "train", preventing the coding agent from reading test data.

The coding agent should only read workspace/traces/latest/. The raw Harbor job output in workspace/tbench_jobs/ contains both train and test data and must not be read directly.

Configuration

Uncomment and edit the terminal-bench block in experiment_config.yaml:

benchmark: "terminal-bench"
agent_model: "gpt-5.4"
split: "train"
gate_split: "test"
env_provider: "e2b"            # "e2b", "daytona", or "docker"
max_concurrency: 50            # tasks run in parallel
threshold: 0.8                 # regression suite pass rate threshold
reasoning_effort: "medium"     # optional
per_task_timeout: 1200         # seconds; tasks that exceed this score 0.0 in val_score

Required environment variables:

OPENAI_API_KEY (or ANTHROPIC_API_KEY for Claude models)
E2B_API_KEY (if using the e2b provider) or DAYTONA_API_KEY (if using daytona)

Quick start

Install harbor

uv tool install harbor

Set environment variables

cp .env.example .env
# Set OPENAI_API_KEY and E2B_API_KEY in .env

Configure the experiment

cp experiment_config.yaml.template experiment_config.yaml
# Uncomment the terminal-bench section and set your model

Run prepare.py

python prepare.py

This runs all 89 tasks, generates the train/test split at tbench_data/task_split.json, and records the baseline score as iteration 0.

Start the optimization loop

Point your coding agent at the repo and prompt:

Read PROGRAM.md and start the optimization loop.
The baseline is already recorded. Start from step 2 (analyze failures).

Known techniques that improve scores

The program_templates/terminal_bench.md file documents techniques the coding agent can apply to agent/agent.py:

Environment bootstrapping — gather OS info, installed tools, and file listing before starting (+5–10%)
Enforced TODO planning — make the model create and maintain a step-by-step plan (+10–20%, largest single gain)
Non-interactive mode — never ask clarifying questions, always act (+3–5%)
Double-confirmation — verify task completion before declaring done (+3–5%)
Forced reasoning in tool schema — add analysis and plan fields to the bash tool definition

To see all changes the coding agent has made relative to the starting template, run diff agent/templates/terminal_bench.py agent/agent.py.

Get Started

Core Concepts

Benchmarks

Extending

Terminal-Bench 2.0: bash agent benchmark in Harbor

Agent interface

Environment providers

TerminalBenchRunner

Constructor

Split file

Running specific tasks

Result schema

Trace management

Configuration

Quick start

Known techniques that improve scores

Build docs developers (and LLMs) love

Get Started

Core Concepts

Benchmarks

Extending

Documentation Index

​Agent interface

​Environment providers

​TerminalBenchRunner

​Constructor

​Split file

​Running specific tasks

​Result schema

​Trace management

​Configuration

​Quick start

​Known techniques that improve scores

Build docs developers (and LLMs) love

Agent interface

Environment providers

TerminalBenchRunner

Constructor

Split file

Running specific tasks

Result schema

Trace management

Configuration

Quick start

Known techniques that improve scores