auto-harness: self-improving agent optimization loop

auto-harness is an open-source framework by Neosigma that turns any coding agent into a self-improving system. You give it a benchmark and an agent file. It runs a tight optimization loop: benchmark your agent, analyze failures from traces, improve agent/agent.py, gate the change against a self-maintained eval suite, commit only what passes, and repeat. On tau-bench, this loop pushed agent score from 0.56 to 0.78 — roughly a 40% jump — through automated failure mining and harness optimization, with no manual curation of the eval suite.

The optimization loop

Every iteration follows the same cycle, defined in PROGRAM.md and executed autonomously by your coding agent:

run benchmark → analyze failures → improve agent/agent.py → gate → record → update learnings → repeat

Step	What happens
Run benchmark	`python benchmark.py` runs all train-split tasks and saves per-task pass/fail to `workspace/train_results.json`
Analyze failures	The coding agent reads train traces, diagnoses root causes, appends patterns to `workspace/learnings.md`
Improve agent	The coding agent edits `agent/agent.py` — one focused change per iteration
Gate	`python gating.py` runs three sequential checks: regression suite, full test score, suite promotion
Record	On gate pass, the agent commits and runs `python record.py` to append the result to `workspace/results.tsv`
Repeat	The loop continues until score plateaus for 5 consecutive iterations

The coding agent only ever edits agent/agent.py and appends to workspace/learnings.md. The gating step enforces this with a git diff file guard — modifying any infrastructure file fails the gate immediately.

Key design principles

Program the loop, not the agent directly. You steer behavior through PROGRAM.md. The coding agent edits agent/agent.py. This separation lets you adjust the optimization strategy without touching the agent under test. Benchmark-agnostic loop. The same gating, recording, and workspace format works for any benchmark that returns per-task rewards between 0.0 and 1.0. To add a new benchmark, subclass BenchmarkRunner and register it in two places. Self-maintained evals. The coding agent decides which tasks belong in the regression suite (workspace/suite.json) — no manual curation needed. After each successful gate, newly-passing tasks are automatically promoted into the suite. Gate everything. No change is committed without passing both the eval suite regression check and the full test score gate. The test score must be ≥ the best score seen so far. Structural anti-cheating. Test traces are never saved to disk. The coding agent can only read train-split traces, so every improvement must generalize. Learnings close the feedback loop. After each iteration the agent writes workspace/learnings.md: what it tried, what worked, what it needs from the human. This log is the agent’s memory across sessions.

Supported benchmarks

Benchmark	Domain	Tasks	Agent interface
Terminal-Bench 2.0	Real-world terminal tasks (coding, sysadmin, security)	89	Bash commands via Harbor containers
tau-bench	Customer service (retail, airline, telecom)	retail: 114, airline: 50, telecom: 114	Structured tool calls via tau2
BIRD-Interact	Interactive text-to-SQL (multi-turn CRUD over Postgres)	lite: 300, full: 600	Google ADK agent against a 3-service environment

Get started

Terminal-Bench 2.0 quickstart

Set up the optimization loop on 89 real-world terminal tasks. Requires the harbor CLI and an E2B or Daytona API key.

tau-bench quickstart

Run the loop on customer service tasks across retail, airline, and telecom domains. Requires Docker.

BIRD-Interact quickstart

Run the loop on interactive text-to-SQL tasks over a live Postgres database. Requires Docker and Python 3.12+.

The full blog post

Read how the 0.56 → 0.78 result was achieved on tau-bench using this exact loop.

Project structure

agent/
  agent.py                  the agent under optimization — only file the coding agent edits
  templates/                read-only starting points for each benchmark
  helpers/
    bird_interact/          FastAPI wrapper and ADK runtime adapter for BIRD-Interact
benchmark.py                benchmark execution layer (abstract + tau-bench + terminal-bench + bird-interact)
gating.py                   three-step gate (regression suite → full test → suite promotion)
prepare.py                  workspace setup, template copying, baseline run
record.py                   appends iteration result to results.tsv
PROGRAM.md                  loop instructions for the coding agent (generated by prepare.py)
program_templates/          benchmark-specific PROGRAM.md templates
experiment_config.yaml      your experiment configuration (copy from .template)
workspace/
  suite.json                regression eval suite (task IDs + threshold)
  learnings.md              per-run log: patterns, what worked, requests to human
  results.tsv               iteration history (val_score, commit, evals, timestamp)
  traces/                   agent conversation traces for failure analysis

Get Started

Core Concepts

Benchmarks

Extending

auto-harness: self-improving agent optimization loop

The optimization loop

Key design principles

Supported benchmarks

Get started

Terminal-Bench 2.0 quickstart

tau-bench quickstart

BIRD-Interact quickstart

The full blog post

Project structure

Build docs developers (and LLMs) love

Get Started

Core Concepts

Benchmarks

Extending

Documentation Index

​The optimization loop

​Key design principles

​Supported benchmarks

​Get started

Terminal-Bench 2.0 quickstart

tau-bench quickstart

BIRD-Interact quickstart

The full blog post

​Project structure

Build docs developers (and LLMs) love

The optimization loop

Key design principles

Supported benchmarks

Get started

Project structure