Terminal-Bench 2.0 is a benchmark of 89 real-world terminal tasks that require an agent to solve practical problems by executing bash commands in an isolated Linux container. Tasks span three categories — software development and build tooling, system administration, and security challenges — making it a rigorous test of whether an agent can operate effectively as an autonomous shell user. auto-harness runs it via theDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/neosigmaai/auto-harness/llms.txt
Use this file to discover all available pages before exploring further.
harbor CLI and the TerminalBenchRunner class, which handles task selection, environment setup, result parsing, and trace management.
Agent interface
The agent receives a task description and interacts exclusively through a singlebash tool that executes commands in a Harbor-managed container. There are no structured tool schemas beyond this one call, so the agent must plan and verify its work through shell output alone.
The starting template at agent/templates/terminal_bench.py defines the initial system prompt, the bash tool schema, and the HarnessAgent.run() loop. The optimization loop edits agent/agent.py (copied from that template by prepare.py) to improve performance.
Environment providers
Harbor supports three sandbox providers. Setenv_provider in experiment_config.yaml:
| Provider | Description | Required credential |
|---|---|---|
e2b | Hosted cloud sandboxes via E2B | E2B_API_KEY |
daytona | Hosted sandboxes via Daytona | DAYTONA_API_KEY |
docker | Local Docker containers | None |
TerminalBenchRunner
TerminalBenchRunner in benchmark.py is the concrete BenchmarkRunner subclass for Terminal-Bench 2.0. It invokes harbor run as a subprocess, waits for results, and parses per-task result.json files.
Constructor
Split file
The train/test split is stored attbench_data/task_split.json (the SPLIT_FILE class constant). This file is created by prepare.py during the baseline run and is never overwritten by subsequent runs.
Running specific tasks
Pass a list of task ID strings torun() to execute a subset:
Result schema
Harbor writes aresult.json file for each completed task. TerminalBenchRunner expects this exact schema:
verifier_result is absent (the verifier did not run — usually an infrastructure error), the runner records None for that task, which counts as 0.0 in val_score.
Trace management
After each train-split run, the runner copies traces from the Harbor output directory into the workspace:| Directory | Contents | Overwritten? |
|---|---|---|
workspace/traces/latest/ | Most recent run per task | Yes, every run |
workspace/traces/baseline/ | First-run traces | No — written once |
workspace/traces/latest/. The raw Harbor job output in workspace/tbench_jobs/ contains both train and test data and must not be read directly.
Configuration
Uncomment and edit the terminal-bench block inexperiment_config.yaml:
OPENAI_API_KEY(orANTHROPIC_API_KEYfor Claude models)E2B_API_KEY(if using thee2bprovider) orDAYTONA_API_KEY(if usingdaytona)
Quick start
Run prepare.py
tbench_data/task_split.json, and records the baseline score as iteration 0.Known techniques that improve scores
Theprogram_templates/terminal_bench.md file documents techniques the coding agent can apply to agent/agent.py:
- Environment bootstrapping — gather OS info, installed tools, and file listing before starting (+5–10%)
- Enforced TODO planning — make the model create and maintain a step-by-step plan (+10–20%, largest single gain)
- Non-interactive mode — never ask clarifying questions, always act (+3–5%)
- Double-confirmation — verify task completion before declaring done (+3–5%)
- Forced reasoning in tool schema — add
analysisandplanfields to the bash tool definition
To see all changes the coding agent has made relative to the starting template, run
diff agent/templates/terminal_bench.py agent/agent.py.