Get started with Terminal-Bench 2.0 and auto-harness

Terminal-Bench 2.0 is a suite of 89 real-world terminal tasks covering coding, sysadmin, and security scenarios. auto-harness wraps it in a continuous optimization loop: python prepare.py runs all 89 tasks, generates a stratified 70/30 train/test split, and records the baseline score. From there, your coding agent reads failure traces, edits agent/agent.py, gates every change, and iterates. This page walks through the complete setup from clone to first loop iteration.

Requirements

harbor CLI — runs benchmark tasks inside containers
OPENAI_API_KEY (or ANTHROPIC_API_KEY / GEMINI_API_KEY depending on your chosen model)
E2B_API_KEY or DAYTONA_API_KEY — sandbox environment provider (see note below)
A coding agent — Claude Code, Codex CLI, or any agent that can read files and run shell commands

If you use env_provider: "docker" in your config, no sandbox provider key is needed. Docker runs the environment locally instead of in a remote sandbox.

Setup

Clone the repository

git clone https://github.com/neosigmaai/auto-harness
cd auto-harness

Install the harbor CLI

uv tool install harbor

Verify the install by running harbor --version. If uv is not installed, follow the uv installation guide.

Set up environment variables

cp .env.example .env

Open .env and fill in your keys:

# LLM API keys — set whichever your agent_model needs
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=      # set if using a claude model
GEMINI_API_KEY=         # set if using a gemini model

# Terminal-Bench sandbox provider (set one)
E2B_API_KEY=e2b-...
DAYTONA_API_KEY=        # alternative to E2B

Configure the experiment

cp experiment_config.yaml.template experiment_config.yaml

Open experiment_config.yaml and uncomment the Terminal-Bench section, then edit to match your setup:

benchmark: "terminal-bench"
agent_model: "gpt-5.4"
split: "train"
gate_split: "test"
env_provider: "e2b"            # "e2b", "daytona", or "docker"
max_concurrency: 50            # tasks run in parallel
threshold: 0.8                 # regression suite pass rate threshold
reasoning_effort: "medium"     # optional
per_task_timeout: 1200         # seconds; tasks exceeding this score 0.0

Set max_concurrency to match your sandbox provider’s concurrency limit. E2B supports up to 50 parallel sandboxes on most plans. For docker, lower this to match your local CPU count.

Initialize the workspace and run the baseline

python prepare.py

prepare.py does the following in order:

Validates all required environment variables and confirms harbor is on PATH
Creates workspace/ and initializes suite.json, learnings.md, results.tsv, and train_results.json
Copies agent/templates/terminal_bench.py into agent/agent.py as the starting point
Composes PROGRAM.md from program_templates/base.md + program_templates/terminal_bench.md
Runs all 89 tasks (no split yet), generates a stratified 70/30 train/test split at tbench_data/task_split.json
Records the baseline score as iteration 0 in workspace/results.tsv

The baseline run executes all 89 tasks in parallel and can take 20–40 minutes depending on your sandbox provider and per_task_timeout. Tasks that time out are excluded from the train/test split and logged as warnings.

Once complete, you will see output like:

[prepare] task split created: 62 train, 27 test
[prepare] baseline val_score=0.4074 (11/27 passed) — recorded as iteration 0

[prepare] done. Ready to start the optimization loop.

Start the optimization loop

Point your coding agent at the repository and use the following prompt:

Read PROGRAM.md and start the optimization loop.
The baseline is already recorded. Start from step 2 (analyze failures).

The agent will:

Read workspace/train_results.json to identify failing tasks
Read train-split traces from workspace/traces/ to diagnose root causes
Edit agent/agent.py with one focused improvement
Run python gating.py to gate the change
If the gate passes: commit, run python record.py, update workspace/learnings.md
If the gate fails: revert with git checkout agent/agent.py and try a different approach
Repeat

Running individual tasks

To test a specific task interactively during development:

python benchmark.py --task-ids <task-id> <task-id>

This runs only those tasks and prints per-task pass/fail, without writing to train_results.json.

Sandbox provider options

Provider	Key required	Notes
`e2b`	`E2B_API_KEY`	Default. Cloud sandboxes; high concurrency supported
`daytona`	`DAYTONA_API_KEY`	Alternative cloud sandbox provider
`docker`	None	Runs containers locally; reduce `max_concurrency` accordingly

Tracking progress

After each successful gate pass, check:

workspace/results.tsv — iteration history with val_score per iteration
workspace/learnings.md — what the agent tried, what worked, and what it needs from you
workspace/suite.json — the growing set of tasks the agent must always pass

To see what the agent changed in the last iteration:

diff agent/templates/terminal_bench.py agent/agent.py

Get Started

Core Concepts

Benchmarks

Extending

Get started with Terminal-Bench 2.0 and auto-harness

Requirements

Setup

Running individual tasks

Sandbox provider options

Tracking progress

Build docs developers (and LLMs) love

Get Started

Core Concepts

Benchmarks

Extending

Documentation Index

​Requirements

​Setup

​Running individual tasks

​Sandbox provider options

​Tracking progress

Build docs developers (and LLMs) love

Requirements

Setup

Running individual tasks

Sandbox provider options

Tracking progress