Documentation Index
Fetch the complete documentation index at: https://mintlify.com/neosigmaai/auto-harness/llms.txt
Use this file to discover all available pages before exploring further.
PROGRAM.md is the instruction set that drives the coding agent’s optimization loop. It tells the coding agent (Claude Code, Codex CLI, or any similar tool) exactly what to do in each iteration: which commands to run, which files to read and edit, how to interpret results, and when to stop. prepare.py generates PROGRAM.md automatically from two source files: a shared base (program_templates/base.md) and a benchmark-specific supplement.
How prepare.py composes PROGRAM.md
When you runpython prepare.py, it performs the following composition:
PROGRAM.md at the repo root. The coding agent reads this file at the start of each session. Do not edit PROGRAM.md directly — it is regenerated on every prepare.py run. To change the base loop, edit program_templates/base.md. To change benchmark-specific guidance, edit the corresponding supplement.
The three built-in supplements are:
| Supplement file | Benchmark |
|---|---|
program_templates/terminal_bench.md | terminal-bench |
program_templates/tau_bench.md | tau-bench |
program_templates/bird_interact.md | bird-interact |
The 7-step loop
The base template defines a repeating loop. The coding agent works through these steps sequentially, then returns to step 1.Step 1 — Run benchmark
workspace/train_results.json. The coding agent reads stdout to identify which tasks failed.
Step 2 — Analyze failures
The coding agent reads train-split traces for failing tasks to determine root cause. It never reads test data — only train traces are available. Key questions to answer:- What commands did the agent run?
- Did it understand the task correctly?
- Did it explore the environment before acting?
- Was there a prompt issue, a tool issue, or a logic issue?
workspace/learnings.md.
Step 3 — Improve agent
The coding agent editsagent/agent.py. It owns the entire file. Changes are kept focused — one hypothesis per iteration, small and reversible.
Step 4 — Gate
| Step | What it checks | Failure action |
|---|---|---|
| Step 0 — File guard | No tracked files outside agent/agent.py and PROGRAM.md were modified | Exit 1 immediately |
| Step 1 — Regression suite | Tasks in suite.json pass at rate ≥ threshold | Exit 1 |
| Step 2 — Full test | val_score on the test split ≥ best recorded in results.tsv | Exit 1 |
| Step 3 — Suite promotion | Re-runs previously-failing train tasks, promotes newly-passing ones into suite.json | (runs only if Steps 1+2 pass) |
agent/agent.py and try a different approach:
Step 5 — Record
After the gate exits 0, commit and record:evals-passed and evals-total refer to the regression suite results from gate Step 1.
Step 6 — Update learnings
After every iteration — whether the gate passed or failed — the coding agent appends an entry toworkspace/learnings.md:
Step 7 — Repeat
Go to step 1. The agent stops whenval_score has not improved for 5 consecutive iterations, at which point it writes a summary in learnings.md and surfaces its top findings.
Rules enforced by the base template
The base template defines 7 rules that the coding agent is instructed to follow:- Only edit
agent/agent.pyandworkspace/learnings.md. The file guard enforces this at gate time — modifying any other tracked file fails immediately. - Never skip the gate. Every committed change must pass all three gate steps.
- One hypothesis per iteration. Keep changes small and reversible.
- Always update
learnings.md. Even on failure; the log is the agent’s memory. - Never use test data to guide changes. Only train failures inform improvements.
- Per-task timeouts count as failures. Any task that does not produce a verifier result within
per_task_timeoutscores0.0. Consistent timeouts are a signal to simplify the prompt, not to ignore the missing reward. - Stop when
val_scorehas not improved for 5 consecutive iterations. Write a summary and surface top findings to the human.
File formats
The coding agent reads and writes several workspace files during the loop. These formats are fixed by the infrastructure scripts.workspace/suite.json
Managed automatically by gating.py. Do not edit.
tasks array grows as iterations fix previously-failing train tasks and both gate steps pass. The threshold value is set from experiment_config.yaml when prepare.py creates the file.
workspace/train_results.json
Written by benchmark.py. Do not edit.
[0.0, 1.0]. A null value means the task timed out and the verifier did not run.
workspace/results.tsv
Tab-separated. Written by record.py. The coding agent reads this to determine the best val_score seen so far.
prepare.py.
Writing a benchmark-specific supplement
If you add a custom benchmark by subclassingBenchmarkRunner, you can provide a benchmark-specific supplement that prepare.py will append to the base template.
Create program_templates/<your_benchmark>.md. The file is appended verbatim after the base content. A minimal supplement should cover:
- Task ID format — how to reference tasks when running
python benchmark.py --task-ids ... - Trace location — where the coding agent should read failure traces
- What to edit in
agent/agent.py— which classes, methods, or constants are the primary optimization targets - Benchmark-specific constraints — anything the coding agent must not do for this benchmark
Analyzing Failures (Step 2)
Read train traces here:Editing agent/agent.py (Step 3)
Focus changes on:AGENT_INSTRUCTION— the system promptHarnessAgent.run()— the agent loop