Terminal-Bench 2.0 is a suite of 89 real-world terminal tasks covering coding, sysadmin, and security scenarios. auto-harness wraps it in a continuous optimization loop:Documentation Index
Fetch the complete documentation index at: https://mintlify.com/neosigmaai/auto-harness/llms.txt
Use this file to discover all available pages before exploring further.
python prepare.py runs all 89 tasks, generates a stratified 70/30 train/test split, and records the baseline score. From there, your coding agent reads failure traces, edits agent/agent.py, gates every change, and iterates. This page walks through the complete setup from clone to first loop iteration.
Requirements
harborCLI — runs benchmark tasks inside containersOPENAI_API_KEY(orANTHROPIC_API_KEY/GEMINI_API_KEYdepending on your chosen model)E2B_API_KEYorDAYTONA_API_KEY— sandbox environment provider (see note below)- A coding agent — Claude Code, Codex CLI, or any agent that can read files and run shell commands
If you use
env_provider: "docker" in your config, no sandbox provider key is needed. Docker runs the environment locally instead of in a remote sandbox.Setup
Install the harbor CLI
harbor --version. If uv is not installed, follow the uv installation guide.Configure the experiment
experiment_config.yaml and uncomment the Terminal-Bench section, then edit to match your setup:Initialize the workspace and run the baseline
prepare.py does the following in order:- Validates all required environment variables and confirms
harboris onPATH - Creates
workspace/and initializessuite.json,learnings.md,results.tsv, andtrain_results.json - Copies
agent/templates/terminal_bench.pyintoagent/agent.pyas the starting point - Composes
PROGRAM.mdfromprogram_templates/base.md+program_templates/terminal_bench.md - Runs all 89 tasks (no split yet), generates a stratified 70/30 train/test split at
tbench_data/task_split.json - Records the baseline score as iteration 0 in
workspace/results.tsv
Start the optimization loop
Point your coding agent at the repository and use the following prompt:The agent will:
- Read
workspace/train_results.jsonto identify failing tasks - Read train-split traces from
workspace/traces/to diagnose root causes - Edit
agent/agent.pywith one focused improvement - Run
python gating.pyto gate the change - If the gate passes: commit, run
python record.py, updateworkspace/learnings.md - If the gate fails: revert with
git checkout agent/agent.pyand try a different approach - Repeat
Running individual tasks
To test a specific task interactively during development:train_results.json.
Sandbox provider options
| Provider | Key required | Notes |
|---|---|---|
e2b | E2B_API_KEY | Default. Cloud sandboxes; high concurrency supported |
daytona | DAYTONA_API_KEY | Alternative cloud sandbox provider |
docker | None | Runs containers locally; reduce max_concurrency accordingly |
Tracking progress
After each successful gate pass, check:workspace/results.tsv— iteration history with val_score per iterationworkspace/learnings.md— what the agent tried, what worked, and what it needs from youworkspace/suite.json— the growing set of tasks the agent must always pass