Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/neosigmaai/auto-harness/llms.txt

Use this file to discover all available pages before exploring further.

The core of auto-harness is a tight, automated feedback loop: a coding agent reads benchmark failures, edits agent/agent.py to address them, gates every change against a live eval suite and a test-split score floor, records the result, and repeats — all driven by the instructions in PROGRAM.md. No human needs to intervene between iterations. The loop is designed to run overnight and accumulate improvements across many cycles.

Overview

run benchmark → analyze → improve agent/agent.py → gate → record → update learnings → repeat
Each step has a clear owner. The coding agent owns the analysis and edits. The harness infrastructure owns the benchmarking, gating, and recording. That separation keeps the loop reproducible: the agent cannot accidentally break the measurement machinery.

Step-by-step walkthrough

1

Run benchmark

The coding agent runs python benchmark.py, which executes the full train split and writes per-task rewards to workspace/train_results.json. The stdout output lists which tasks passed and which failed, giving the agent its raw signal.
python benchmark.py
2

Analyze failures

The agent reads traces from workspace/traces/latest/ for the failing train tasks. It looks for patterns — prompt misunderstandings, missing tool calls, format errors — and appends findings to workspace/learnings.md. Test traces are never available at this stage; only train traces are written to disk, enforcing a strict analysis boundary.
3

Improve agent/agent.py

Based on the failure analysis, the agent edits agent/agent.py. This is the only file under the agent’s direct control. The benchmark runner imports HarnessAgent directly from this file, so any change here takes effect on the next run. The recommended practice is one focused hypothesis per iteration — smaller changes are easier to gate and easier to revert if they fail.
4

Gate the change

The agent runs python gating.py. Four steps execute in sequence: a file guard, a regression suite pass rate check, a full test-split score check, and — if all pass — suite promotion. Exit code 0 means the change is safe to commit. Exit code 1 means the change regressed something; the agent reverts and tries a different approach.
python gating.py
5

Record the result

After a gate pass, the agent commits agent/agent.py and calls record.py to append the iteration’s val_score, suite pass counts, commit hash, and timestamp to workspace/results.tsv.
git add agent/agent.py
git commit -m "improve: <what changed and why>"
python record.py --val-score <X> --evals-passed <N> --evals-total <M>
6

Update learnings

Whether the gate passed or failed, the agent appends a structured entry to workspace/learnings.md. This entry records what changed, what patterns were confirmed, what worked or didn’t, and any unresolved questions for the human operator.
7

Repeat

The agent returns to step 1 and begins the next iteration, carrying forward all prior context from learnings.md.

PROGRAM.md: the agent’s instruction set

PROGRAM.md is the document the coding agent reads at the start of every session. It specifies the exact loop steps, the commands to run at each step, the rules the agent must follow, and the file formats it will encounter. It is generated by prepare.py from a benchmark-specific template in program_templates/ and is committed to the repo so the coding agent always has a stable reference. Operators steer the agent’s behavior by editing PROGRAM.md — adding domain-specific guidance, adjusting constraints, or flagging patterns the agent should focus on. The agent edits agent/agent.py. This division means you can change the agent’s strategy without touching any benchmark code.

What the coding agent does vs what the harness does

ResponsibilityCoding agentHarness infrastructure
Editsagent/agent.py, workspace/learnings.mdAll other files
AnalysisReads train traces, identifies patternsWrites traces to workspace/traces/latest/
GatingRuns gating.py, interprets exit codeExecutes benchmark, checks pass rates and score
RecordingRuns record.py with correct argsAppends row to results.tsv, validates commit
MemoryAppends to learnings.md each iterationMaintains suite.json, results.tsv, train_results.json

Train/test split and anti-cheating design

Every benchmark is split into a train portion and a test portion at setup time by prepare.py. The coding agent can run the train split freely and read its traces. The test split is reserved for gating: gating.py always runs the full test benchmark as Step 2, but the resulting traces are never written to disk. This is a deliberate structural constraint. Because test traces are never saved, the coding agent has no way to read test failures and tune agent/agent.py specifically against them. Improvements must generalize from the train split to the test split, which is what the val_score gate measures.
The only traces available in workspace/traces/ are from the train split. Test-split execution happens inside gating.py and its output is discarded after the score is extracted.

learnings.md as cross-iteration memory

workspace/learnings.md is the agent’s persistent log across all iterations. After every iteration — regardless of whether the gate passed or failed — the agent appends a structured entry covering what it tried, what patterns it confirmed, what worked, and what it needs from the human operator. Because workspace/ is gitignored, learnings.md is not subject to the file guard in gating.py. The agent can edit it freely at any point without affecting the gate outcome. Over many iterations the file accumulates a detailed record of the optimization trajectory, which informs future hypotheses and surfaces persistent failure modes that need human attention.
## Iteration N — val_score: X.XX → Y.YY ✓/✗

**What changed:** <one sentence>

**Pattern confirmed:** <failure mode>

**What worked / didn't work:** <specifics>

**Needs from human:** <or "none">

Build docs developers (and LLMs) love