auto-harness is an open-source framework by Neosigma that turns any coding agent into a self-improving system. You give it a benchmark and an agent file. It runs a tight optimization loop: benchmark your agent, analyze failures from traces, improveDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/neosigmaai/auto-harness/llms.txt
Use this file to discover all available pages before exploring further.
agent/agent.py, gate the change against a self-maintained eval suite, commit only what passes, and repeat. On tau-bench, this loop pushed agent score from 0.56 to 0.78 — roughly a 40% jump — through automated failure mining and harness optimization, with no manual curation of the eval suite.
The optimization loop
Every iteration follows the same cycle, defined inPROGRAM.md and executed autonomously by your coding agent:
| Step | What happens |
|---|---|
| Run benchmark | python benchmark.py runs all train-split tasks and saves per-task pass/fail to workspace/train_results.json |
| Analyze failures | The coding agent reads train traces, diagnoses root causes, appends patterns to workspace/learnings.md |
| Improve agent | The coding agent edits agent/agent.py — one focused change per iteration |
| Gate | python gating.py runs three sequential checks: regression suite, full test score, suite promotion |
| Record | On gate pass, the agent commits and runs python record.py to append the result to workspace/results.tsv |
| Repeat | The loop continues until score plateaus for 5 consecutive iterations |
The coding agent only ever edits
agent/agent.py and appends to workspace/learnings.md. The gating step enforces this with a git diff file guard — modifying any infrastructure file fails the gate immediately.Key design principles
Program the loop, not the agent directly. You steer behavior throughPROGRAM.md. The coding agent edits agent/agent.py. This separation lets you adjust the optimization strategy without touching the agent under test.
Benchmark-agnostic loop. The same gating, recording, and workspace format works for any benchmark that returns per-task rewards between 0.0 and 1.0. To add a new benchmark, subclass BenchmarkRunner and register it in two places.
Self-maintained evals. The coding agent decides which tasks belong in the regression suite (workspace/suite.json) — no manual curation needed. After each successful gate, newly-passing tasks are automatically promoted into the suite.
Gate everything. No change is committed without passing both the eval suite regression check and the full test score gate. The test score must be ≥ the best score seen so far.
Structural anti-cheating. Test traces are never saved to disk. The coding agent can only read train-split traces, so every improvement must generalize.
Learnings close the feedback loop. After each iteration the agent writes workspace/learnings.md: what it tried, what worked, what it needs from the human. This log is the agent’s memory across sessions.
Supported benchmarks
| Benchmark | Domain | Tasks | Agent interface |
|---|---|---|---|
| Terminal-Bench 2.0 | Real-world terminal tasks (coding, sysadmin, security) | 89 | Bash commands via Harbor containers |
| tau-bench | Customer service (retail, airline, telecom) | retail: 114, airline: 50, telecom: 114 | Structured tool calls via tau2 |
| BIRD-Interact | Interactive text-to-SQL (multi-turn CRUD over Postgres) | lite: 300, full: 600 | Google ADK agent against a 3-service environment |
Get started
Terminal-Bench 2.0 quickstart
Set up the optimization loop on 89 real-world terminal tasks. Requires the
harbor CLI and an E2B or Daytona API key.tau-bench quickstart
Run the loop on customer service tasks across retail, airline, and telecom domains. Requires Docker.
BIRD-Interact quickstart
Run the loop on interactive text-to-SQL tasks over a live Postgres database. Requires Docker and Python 3.12+.
The full blog post
Read how the 0.56 → 0.78 result was achieved on tau-bench using this exact loop.