Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/neosigmaai/auto-harness/llms.txt

Use this file to discover all available pages before exploring further.

auto-harness is an open-source framework by Neosigma that turns any coding agent into a self-improving system. You give it a benchmark and an agent file. It runs a tight optimization loop: benchmark your agent, analyze failures from traces, improve agent/agent.py, gate the change against a self-maintained eval suite, commit only what passes, and repeat. On tau-bench, this loop pushed agent score from 0.56 to 0.78 — roughly a 40% jump — through automated failure mining and harness optimization, with no manual curation of the eval suite.

The optimization loop

Every iteration follows the same cycle, defined in PROGRAM.md and executed autonomously by your coding agent:
run benchmark → analyze failures → improve agent/agent.py → gate → record → update learnings → repeat
StepWhat happens
Run benchmarkpython benchmark.py runs all train-split tasks and saves per-task pass/fail to workspace/train_results.json
Analyze failuresThe coding agent reads train traces, diagnoses root causes, appends patterns to workspace/learnings.md
Improve agentThe coding agent edits agent/agent.py — one focused change per iteration
Gatepython gating.py runs three sequential checks: regression suite, full test score, suite promotion
RecordOn gate pass, the agent commits and runs python record.py to append the result to workspace/results.tsv
RepeatThe loop continues until score plateaus for 5 consecutive iterations
The coding agent only ever edits agent/agent.py and appends to workspace/learnings.md. The gating step enforces this with a git diff file guard — modifying any infrastructure file fails the gate immediately.

Key design principles

Program the loop, not the agent directly. You steer behavior through PROGRAM.md. The coding agent edits agent/agent.py. This separation lets you adjust the optimization strategy without touching the agent under test. Benchmark-agnostic loop. The same gating, recording, and workspace format works for any benchmark that returns per-task rewards between 0.0 and 1.0. To add a new benchmark, subclass BenchmarkRunner and register it in two places. Self-maintained evals. The coding agent decides which tasks belong in the regression suite (workspace/suite.json) — no manual curation needed. After each successful gate, newly-passing tasks are automatically promoted into the suite. Gate everything. No change is committed without passing both the eval suite regression check and the full test score gate. The test score must be ≥ the best score seen so far. Structural anti-cheating. Test traces are never saved to disk. The coding agent can only read train-split traces, so every improvement must generalize. Learnings close the feedback loop. After each iteration the agent writes workspace/learnings.md: what it tried, what worked, what it needs from the human. This log is the agent’s memory across sessions.

Supported benchmarks

BenchmarkDomainTasksAgent interface
Terminal-Bench 2.0Real-world terminal tasks (coding, sysadmin, security)89Bash commands via Harbor containers
tau-benchCustomer service (retail, airline, telecom)retail: 114, airline: 50, telecom: 114Structured tool calls via tau2
BIRD-InteractInteractive text-to-SQL (multi-turn CRUD over Postgres)lite: 300, full: 600Google ADK agent against a 3-service environment

Get started

Terminal-Bench 2.0 quickstart

Set up the optimization loop on 89 real-world terminal tasks. Requires the harbor CLI and an E2B or Daytona API key.

tau-bench quickstart

Run the loop on customer service tasks across retail, airline, and telecom domains. Requires Docker.

BIRD-Interact quickstart

Run the loop on interactive text-to-SQL tasks over a live Postgres database. Requires Docker and Python 3.12+.

The full blog post

Read how the 0.56 → 0.78 result was achieved on tau-bench using this exact loop.

Project structure

agent/
  agent.py                  the agent under optimization — only file the coding agent edits
  templates/                read-only starting points for each benchmark
  helpers/
    bird_interact/          FastAPI wrapper and ADK runtime adapter for BIRD-Interact
benchmark.py                benchmark execution layer (abstract + tau-bench + terminal-bench + bird-interact)
gating.py                   three-step gate (regression suite → full test → suite promotion)
prepare.py                  workspace setup, template copying, baseline run
record.py                   appends iteration result to results.tsv
PROGRAM.md                  loop instructions for the coding agent (generated by prepare.py)
program_templates/          benchmark-specific PROGRAM.md templates
experiment_config.yaml      your experiment configuration (copy from .template)
workspace/
  suite.json                regression eval suite (task IDs + threshold)
  learnings.md              per-run log: patterns, what worked, requests to human
  results.tsv               iteration history (val_score, commit, evals, timestamp)
  traces/                   agent conversation traces for failure analysis

Build docs developers (and LLMs) love