Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/neosigmaai/auto-harness/llms.txt

Use this file to discover all available pages before exploring further.

auto-harness is an open-source framework by Neosigma for building self-improving agentic systems. You point it at a benchmark and a coding agent, and it runs a continuous loop: benchmark your agent, analyze failures, improve agent/agent.py, gate the change against a self-maintained eval suite, and repeat — overnight, unattended. On tau-bench, this loop improved agent score from 0.56 to 0.78 (~40% improvement) through automated failure mining and harness optimization.

Quickstart: Terminal-Bench

Get the optimization loop running on real-world terminal tasks in minutes

Quickstart: tau-bench

Run the loop on customer service benchmark tasks with Docker

The Optimization Loop

Understand how benchmark → analyze → improve → gate → repeat works

Extend with Your Benchmark

Plug in any benchmark by subclassing BenchmarkRunner

How it works

auto-harness separates the loop infrastructure from the agent under optimization. You write (or start from a template) an agent/agent.py that implements HarnessAgent. The harness handles everything else: running the benchmark, analyzing failure traces, gating changes, and recording results.
1

Initialize the workspace

Run python prepare.py to set up workspace files, copy the agent template for your benchmark, and record a baseline score.
2

Start the optimization loop

Point your coding agent (Claude Code, Codex CLI, or similar) at the repo and prompt it to read PROGRAM.md and start the loop.
3

The agent iterates automatically

The coding agent reads failure traces, edits agent/agent.py, runs python gating.py to gate the change, commits if it passes, and records results to workspace/results.tsv.
4

Review and steer

Check workspace/learnings.md after each session. The agent logs what it tried, what worked, and what it needs from you.

Supported benchmarks

Terminal-Bench 2.0

89 real-world terminal tasks covering coding, sysadmin, and security

tau-bench

Customer service tasks across retail, airline, and telecom domains

BIRD-Interact

Interactive text-to-SQL with multi-turn CRUD over Postgres

Key design principles

  • Program the loop, not the agent directly. You steer through PROGRAM.md; the coding agent edits agent/agent.py.
  • Benchmark-agnostic. The same gating, recording, and workspace format works for any benchmark that returns per-task rewards.
  • Self-maintained evals. The coding agent decides which tasks belong in the regression suite — no manual curation needed.
  • Gate everything. No change is committed without passing both the eval suite and the full test score gate.
  • Anti-cheating by design. Test traces are never saved to disk; the coding agent can only read train traces.

Build docs developers (and LLMs) love