The core of auto-harness is a tight, automated feedback loop: a coding agent reads benchmark failures, editsDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/neosigmaai/auto-harness/llms.txt
Use this file to discover all available pages before exploring further.
agent/agent.py to address them, gates every change against a live eval suite and a test-split score floor, records the result, and repeats — all driven by the instructions in PROGRAM.md. No human needs to intervene between iterations. The loop is designed to run overnight and accumulate improvements across many cycles.
Overview
Step-by-step walkthrough
Run benchmark
The coding agent runs
python benchmark.py, which executes the full train split and writes per-task rewards to workspace/train_results.json. The stdout output lists which tasks passed and which failed, giving the agent its raw signal.Analyze failures
The agent reads traces from
workspace/traces/latest/ for the failing train tasks. It looks for patterns — prompt misunderstandings, missing tool calls, format errors — and appends findings to workspace/learnings.md. Test traces are never available at this stage; only train traces are written to disk, enforcing a strict analysis boundary.Improve agent/agent.py
Based on the failure analysis, the agent edits
agent/agent.py. This is the only file under the agent’s direct control. The benchmark runner imports HarnessAgent directly from this file, so any change here takes effect on the next run. The recommended practice is one focused hypothesis per iteration — smaller changes are easier to gate and easier to revert if they fail.Gate the change
The agent runs
python gating.py. Four steps execute in sequence: a file guard, a regression suite pass rate check, a full test-split score check, and — if all pass — suite promotion. Exit code 0 means the change is safe to commit. Exit code 1 means the change regressed something; the agent reverts and tries a different approach.Record the result
After a gate pass, the agent commits
agent/agent.py and calls record.py to append the iteration’s val_score, suite pass counts, commit hash, and timestamp to workspace/results.tsv.Update learnings
Whether the gate passed or failed, the agent appends a structured entry to
workspace/learnings.md. This entry records what changed, what patterns were confirmed, what worked or didn’t, and any unresolved questions for the human operator.PROGRAM.md: the agent’s instruction set
PROGRAM.md is the document the coding agent reads at the start of every session. It specifies the exact loop steps, the commands to run at each step, the rules the agent must follow, and the file formats it will encounter. It is generated by prepare.py from a benchmark-specific template in program_templates/ and is committed to the repo so the coding agent always has a stable reference.
Operators steer the agent’s behavior by editing PROGRAM.md — adding domain-specific guidance, adjusting constraints, or flagging patterns the agent should focus on. The agent edits agent/agent.py. This division means you can change the agent’s strategy without touching any benchmark code.
What the coding agent does vs what the harness does
| Responsibility | Coding agent | Harness infrastructure |
|---|---|---|
| Edits | agent/agent.py, workspace/learnings.md | All other files |
| Analysis | Reads train traces, identifies patterns | Writes traces to workspace/traces/latest/ |
| Gating | Runs gating.py, interprets exit code | Executes benchmark, checks pass rates and score |
| Recording | Runs record.py with correct args | Appends row to results.tsv, validates commit |
| Memory | Appends to learnings.md each iteration | Maintains suite.json, results.tsv, train_results.json |
Train/test split and anti-cheating design
Every benchmark is split into a train portion and a test portion at setup time byprepare.py. The coding agent can run the train split freely and read its traces. The test split is reserved for gating: gating.py always runs the full test benchmark as Step 2, but the resulting traces are never written to disk.
This is a deliberate structural constraint. Because test traces are never saved, the coding agent has no way to read test failures and tune agent/agent.py specifically against them. Improvements must generalize from the train split to the test split, which is what the val_score gate measures.
The only traces available in
workspace/traces/ are from the train split. Test-split execution happens inside gating.py and its output is discarded after the score is extracted.learnings.md as cross-iteration memory
workspace/learnings.md is the agent’s persistent log across all iterations. After every iteration — regardless of whether the gate passed or failed — the agent appends a structured entry covering what it tried, what patterns it confirmed, what worked, and what it needs from the human operator.
Because workspace/ is gitignored, learnings.md is not subject to the file guard in gating.py. The agent can edit it freely at any point without affecting the gate outcome. Over many iterations the file accumulates a detailed record of the optimization trajectory, which informs future hypotheses and surfaces persistent failure modes that need human attention.