Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/neosigmaai/auto-harness/llms.txt

Use this file to discover all available pages before exploring further.

Terminal-Bench 2.0 is a suite of 89 real-world terminal tasks covering coding, sysadmin, and security scenarios. auto-harness wraps it in a continuous optimization loop: python prepare.py runs all 89 tasks, generates a stratified 70/30 train/test split, and records the baseline score. From there, your coding agent reads failure traces, edits agent/agent.py, gates every change, and iterates. This page walks through the complete setup from clone to first loop iteration.

Requirements

  • harbor CLI — runs benchmark tasks inside containers
  • OPENAI_API_KEY (or ANTHROPIC_API_KEY / GEMINI_API_KEY depending on your chosen model)
  • E2B_API_KEY or DAYTONA_API_KEY — sandbox environment provider (see note below)
  • A coding agent — Claude Code, Codex CLI, or any agent that can read files and run shell commands
If you use env_provider: "docker" in your config, no sandbox provider key is needed. Docker runs the environment locally instead of in a remote sandbox.

Setup

1

Clone the repository

git clone https://github.com/neosigmaai/auto-harness
cd auto-harness
2

Install the harbor CLI

uv tool install harbor
Verify the install by running harbor --version. If uv is not installed, follow the uv installation guide.
3

Set up environment variables

cp .env.example .env
Open .env and fill in your keys:
# LLM API keys — set whichever your agent_model needs
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=      # set if using a claude model
GEMINI_API_KEY=         # set if using a gemini model

# Terminal-Bench sandbox provider (set one)
E2B_API_KEY=e2b-...
DAYTONA_API_KEY=        # alternative to E2B
4

Configure the experiment

cp experiment_config.yaml.template experiment_config.yaml
Open experiment_config.yaml and uncomment the Terminal-Bench section, then edit to match your setup:
benchmark: "terminal-bench"
agent_model: "gpt-5.4"
split: "train"
gate_split: "test"
env_provider: "e2b"            # "e2b", "daytona", or "docker"
max_concurrency: 50            # tasks run in parallel
threshold: 0.8                 # regression suite pass rate threshold
reasoning_effort: "medium"     # optional
per_task_timeout: 1200         # seconds; tasks exceeding this score 0.0
Set max_concurrency to match your sandbox provider’s concurrency limit. E2B supports up to 50 parallel sandboxes on most plans. For docker, lower this to match your local CPU count.
5

Initialize the workspace and run the baseline

python prepare.py
prepare.py does the following in order:
  1. Validates all required environment variables and confirms harbor is on PATH
  2. Creates workspace/ and initializes suite.json, learnings.md, results.tsv, and train_results.json
  3. Copies agent/templates/terminal_bench.py into agent/agent.py as the starting point
  4. Composes PROGRAM.md from program_templates/base.md + program_templates/terminal_bench.md
  5. Runs all 89 tasks (no split yet), generates a stratified 70/30 train/test split at tbench_data/task_split.json
  6. Records the baseline score as iteration 0 in workspace/results.tsv
The baseline run executes all 89 tasks in parallel and can take 20–40 minutes depending on your sandbox provider and per_task_timeout. Tasks that time out are excluded from the train/test split and logged as warnings.
Once complete, you will see output like:
[prepare] task split created: 62 train, 27 test
[prepare] baseline val_score=0.4074 (11/27 passed) — recorded as iteration 0

[prepare] done. Ready to start the optimization loop.
6

Start the optimization loop

Point your coding agent at the repository and use the following prompt:
Read PROGRAM.md and start the optimization loop.
The baseline is already recorded. Start from step 2 (analyze failures).
The agent will:
  1. Read workspace/train_results.json to identify failing tasks
  2. Read train-split traces from workspace/traces/ to diagnose root causes
  3. Edit agent/agent.py with one focused improvement
  4. Run python gating.py to gate the change
  5. If the gate passes: commit, run python record.py, update workspace/learnings.md
  6. If the gate fails: revert with git checkout agent/agent.py and try a different approach
  7. Repeat

Running individual tasks

To test a specific task interactively during development:
python benchmark.py --task-ids <task-id> <task-id>
This runs only those tasks and prints per-task pass/fail, without writing to train_results.json.

Sandbox provider options

ProviderKey requiredNotes
e2bE2B_API_KEYDefault. Cloud sandboxes; high concurrency supported
daytonaDAYTONA_API_KEYAlternative cloud sandbox provider
dockerNoneRuns containers locally; reduce max_concurrency accordingly

Tracking progress

After each successful gate pass, check:
  • workspace/results.tsv — iteration history with val_score per iteration
  • workspace/learnings.md — what the agent tried, what worked, and what it needs from you
  • workspace/suite.json — the growing set of tasks the agent must always pass
To see what the agent changed in the last iteration:
diff agent/templates/terminal_bench.py agent/agent.py

Build docs developers (and LLMs) love