The auto-harness optimization loop: how it works

The core of auto-harness is a tight, automated feedback loop: a coding agent reads benchmark failures, edits agent/agent.py to address them, gates every change against a live eval suite and a test-split score floor, records the result, and repeats — all driven by the instructions in PROGRAM.md. No human needs to intervene between iterations. The loop is designed to run overnight and accumulate improvements across many cycles.

Overview

run benchmark → analyze → improve agent/agent.py → gate → record → update learnings → repeat

Each step has a clear owner. The coding agent owns the analysis and edits. The harness infrastructure owns the benchmarking, gating, and recording. That separation keeps the loop reproducible: the agent cannot accidentally break the measurement machinery.

Step-by-step walkthrough

Run benchmark

The coding agent runs python benchmark.py, which executes the full train split and writes per-task rewards to workspace/train_results.json. The stdout output lists which tasks passed and which failed, giving the agent its raw signal.

python benchmark.py

Analyze failures

The agent reads traces from workspace/traces/latest/ for the failing train tasks. It looks for patterns — prompt misunderstandings, missing tool calls, format errors — and appends findings to workspace/learnings.md. Test traces are never available at this stage; only train traces are written to disk, enforcing a strict analysis boundary.

Improve agent/agent.py

Based on the failure analysis, the agent edits agent/agent.py. This is the only file under the agent’s direct control. The benchmark runner imports HarnessAgent directly from this file, so any change here takes effect on the next run. The recommended practice is one focused hypothesis per iteration — smaller changes are easier to gate and easier to revert if they fail.

Gate the change

The agent runs python gating.py. Four steps execute in sequence: a file guard, a regression suite pass rate check, a full test-split score check, and — if all pass — suite promotion. Exit code 0 means the change is safe to commit. Exit code 1 means the change regressed something; the agent reverts and tries a different approach.

python gating.py

Record the result

After a gate pass, the agent commits agent/agent.py and calls record.py to append the iteration’s val_score, suite pass counts, commit hash, and timestamp to workspace/results.tsv.

git add agent/agent.py
git commit -m "improve: <what changed and why>"
python record.py --val-score <X> --evals-passed <N> --evals-total <M>

Update learnings

Whether the gate passed or failed, the agent appends a structured entry to workspace/learnings.md. This entry records what changed, what patterns were confirmed, what worked or didn’t, and any unresolved questions for the human operator.

Repeat

The agent returns to step 1 and begins the next iteration, carrying forward all prior context from learnings.md.

PROGRAM.md: the agent’s instruction set

PROGRAM.md is the document the coding agent reads at the start of every session. It specifies the exact loop steps, the commands to run at each step, the rules the agent must follow, and the file formats it will encounter. It is generated by prepare.py from a benchmark-specific template in program_templates/ and is committed to the repo so the coding agent always has a stable reference. Operators steer the agent’s behavior by editing PROGRAM.md — adding domain-specific guidance, adjusting constraints, or flagging patterns the agent should focus on. The agent edits agent/agent.py. This division means you can change the agent’s strategy without touching any benchmark code.

What the coding agent does vs what the harness does

Responsibility	Coding agent	Harness infrastructure
Edits	`agent/agent.py`, `workspace/learnings.md`	All other files
Analysis	Reads train traces, identifies patterns	Writes traces to `workspace/traces/latest/`
Gating	Runs `gating.py`, interprets exit code	Executes benchmark, checks pass rates and score
Recording	Runs `record.py` with correct args	Appends row to `results.tsv`, validates commit
Memory	Appends to `learnings.md` each iteration	Maintains `suite.json`, `results.tsv`, `train_results.json`

Train/test split and anti-cheating design

Every benchmark is split into a train portion and a test portion at setup time by prepare.py. The coding agent can run the train split freely and read its traces. The test split is reserved for gating: gating.py always runs the full test benchmark as Step 2, but the resulting traces are never written to disk. This is a deliberate structural constraint. Because test traces are never saved, the coding agent has no way to read test failures and tune agent/agent.py specifically against them. Improvements must generalize from the train split to the test split, which is what the val_score gate measures.

The only traces available in workspace/traces/ are from the train split. Test-split execution happens inside gating.py and its output is discarded after the score is extracted.

learnings.md as cross-iteration memory

workspace/learnings.md is the agent’s persistent log across all iterations. After every iteration — regardless of whether the gate passed or failed — the agent appends a structured entry covering what it tried, what patterns it confirmed, what worked, and what it needs from the human operator. Because workspace/ is gitignored, learnings.md is not subject to the file guard in gating.py. The agent can edit it freely at any point without affecting the gate outcome. Over many iterations the file accumulates a detailed record of the optimization trajectory, which informs future hypotheses and surfaces persistent failure modes that need human attention.

## Iteration N — val_score: X.XX → Y.YY ✓/✗

**What changed:** <one sentence>

**Pattern confirmed:** <failure mode>

**What worked / didn't work:** <specifics>

**Needs from human:** <or "none">

Get Started

Core Concepts

Benchmarks

Extending

The auto-harness optimization loop: how it works

Overview

Step-by-step walkthrough

PROGRAM.md: the agent’s instruction set

What the coding agent does vs what the harness does

Train/test split and anti-cheating design

learnings.md as cross-iteration memory

Build docs developers (and LLMs) love

Get Started

Core Concepts

Benchmarks

Extending

Documentation Index

​Overview

​Step-by-step walkthrough

​PROGRAM.md: the agent’s instruction set

​What the coding agent does vs what the harness does

​Train/test split and anti-cheating design

​learnings.md as cross-iteration memory

Build docs developers (and LLMs) love

Overview

Step-by-step walkthrough

PROGRAM.md: the agent’s instruction set

What the coding agent does vs what the harness does

Train/test split and anti-cheating design

learnings.md as cross-iteration memory