auto-harness: Build self-improving agentic systems

auto-harness is an open-source framework by Neosigma for building self-improving agentic systems. You point it at a benchmark and a coding agent, and it runs a continuous loop: benchmark your agent, analyze failures, improve agent/agent.py, gate the change against a self-maintained eval suite, and repeat — overnight, unattended. On tau-bench, this loop improved agent score from 0.56 to 0.78 (~40% improvement) through automated failure mining and harness optimization.

Quickstart: Terminal-Bench

Get the optimization loop running on real-world terminal tasks in minutes

Quickstart: tau-bench

Run the loop on customer service benchmark tasks with Docker

The Optimization Loop

Understand how benchmark → analyze → improve → gate → repeat works

Extend with Your Benchmark

Plug in any benchmark by subclassing BenchmarkRunner

How it works

auto-harness separates the loop infrastructure from the agent under optimization. You write (or start from a template) an agent/agent.py that implements HarnessAgent. The harness handles everything else: running the benchmark, analyzing failure traces, gating changes, and recording results.

Initialize the workspace

Run python prepare.py to set up workspace files, copy the agent template for your benchmark, and record a baseline score.

Start the optimization loop

Point your coding agent (Claude Code, Codex CLI, or similar) at the repo and prompt it to read PROGRAM.md and start the loop.

The agent iterates automatically

The coding agent reads failure traces, edits agent/agent.py, runs python gating.py to gate the change, commits if it passes, and records results to workspace/results.tsv.

Review and steer

Check workspace/learnings.md after each session. The agent logs what it tried, what worked, and what it needs from you.

Supported benchmarks

Terminal-Bench 2.0

89 real-world terminal tasks covering coding, sysadmin, and security

tau-bench

Customer service tasks across retail, airline, and telecom domains

BIRD-Interact

Interactive text-to-SQL with multi-turn CRUD over Postgres

Key design principles

Program the loop, not the agent directly. You steer through PROGRAM.md; the coding agent edits agent/agent.py.
Benchmark-agnostic. The same gating, recording, and workspace format works for any benchmark that returns per-task rewards.
Self-maintained evals. The coding agent decides which tasks belong in the regression suite — no manual curation needed.
Gate everything. No change is committed without passing both the eval suite and the full test score gate.
Anti-cheating by design. Test traces are never saved to disk; the coding agent can only read train traces.

Get Started

Core Concepts

Benchmarks

Extending

auto-harness: Build self-improving agentic systems

Quickstart: Terminal-Bench

Quickstart: tau-bench

The Optimization Loop

Extend with Your Benchmark

How it works

Supported benchmarks

Terminal-Bench 2.0

tau-bench

BIRD-Interact

Key design principles

Build docs developers (and LLMs) love

Get Started

Core Concepts

Benchmarks

Extending

Documentation Index

Quickstart: Terminal-Bench

Quickstart: tau-bench

The Optimization Loop

Extend with Your Benchmark

​How it works

​Supported benchmarks

Terminal-Bench 2.0

tau-bench

BIRD-Interact

​Key design principles

Build docs developers (and LLMs) love

How it works

Supported benchmarks

Key design principles