auto-harness is an open-source framework by Neosigma for building self-improving agentic systems. You point it at a benchmark and a coding agent, and it runs a continuous loop: benchmark your agent, analyze failures, improveDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/neosigmaai/auto-harness/llms.txt
Use this file to discover all available pages before exploring further.
agent/agent.py, gate the change against a self-maintained eval suite, and repeat — overnight, unattended.
On tau-bench, this loop improved agent score from 0.56 to 0.78 (~40% improvement) through automated failure mining and harness optimization.
Quickstart: Terminal-Bench
Get the optimization loop running on real-world terminal tasks in minutes
Quickstart: tau-bench
Run the loop on customer service benchmark tasks with Docker
The Optimization Loop
Understand how benchmark → analyze → improve → gate → repeat works
Extend with Your Benchmark
Plug in any benchmark by subclassing BenchmarkRunner
How it works
auto-harness separates the loop infrastructure from the agent under optimization. You write (or start from a template) anagent/agent.py that implements HarnessAgent. The harness handles everything else: running the benchmark, analyzing failure traces, gating changes, and recording results.
Initialize the workspace
Run
python prepare.py to set up workspace files, copy the agent template for your benchmark, and record a baseline score.Start the optimization loop
Point your coding agent (Claude Code, Codex CLI, or similar) at the repo and prompt it to read
PROGRAM.md and start the loop.The agent iterates automatically
The coding agent reads failure traces, edits
agent/agent.py, runs python gating.py to gate the change, commits if it passes, and records results to workspace/results.tsv.Supported benchmarks
Terminal-Bench 2.0
89 real-world terminal tasks covering coding, sysadmin, and security
tau-bench
Customer service tasks across retail, airline, and telecom domains
BIRD-Interact
Interactive text-to-SQL with multi-turn CRUD over Postgres
Key design principles
- Program the loop, not the agent directly. You steer through
PROGRAM.md; the coding agent editsagent/agent.py. - Benchmark-agnostic. The same gating, recording, and workspace format works for any benchmark that returns per-task rewards.
- Self-maintained evals. The coding agent decides which tasks belong in the regression suite — no manual curation needed.
- Gate everything. No change is committed without passing both the eval suite and the full test score gate.
- Anti-cheating by design. Test traces are never saved to disk; the coding agent can only read train traces.