PROGRAM.md: loop instructions and benchmark supplements

PROGRAM.md is the instruction set that drives the coding agent’s optimization loop. It tells the coding agent (Claude Code, Codex CLI, or any similar tool) exactly what to do in each iteration: which commands to run, which files to read and edit, how to interpret results, and when to stop. prepare.py generates PROGRAM.md automatically from two source files: a shared base (program_templates/base.md) and a benchmark-specific supplement.

How prepare.py composes PROGRAM.md

When you run python prepare.py, it performs the following composition:

base    = open("program_templates/base.md").read()
section = open(f"program_templates/{benchmark}.md").read()
open("PROGRAM.md", "w").write(base.rstrip("\n") + "\n\n" + section)

The result is a single PROGRAM.md at the repo root. The coding agent reads this file at the start of each session. Do not edit PROGRAM.md directly — it is regenerated on every prepare.py run. To change the base loop, edit program_templates/base.md. To change benchmark-specific guidance, edit the corresponding supplement. The three built-in supplements are:

Supplement file	Benchmark
`program_templates/terminal_bench.md`	`terminal-bench`
`program_templates/tau_bench.md`	`tau-bench`
`program_templates/bird_interact.md`	`bird-interact`

The 7-step loop

The base template defines a repeating loop. The coding agent works through these steps sequentially, then returns to step 1.

Step 1 — Run benchmark

python benchmark.py

Runs the full train split and prints per-task pass/fail results. Results are also saved to workspace/train_results.json. The coding agent reads stdout to identify which tasks failed.

Step 2 — Analyze failures

The coding agent reads train-split traces for failing tasks to determine root cause. It never reads test data — only train traces are available. Key questions to answer:

What commands did the agent run?
Did it understand the task correctly?
Did it explore the environment before acting?
Was there a prompt issue, a tool issue, or a logic issue?

Findings are appended to workspace/learnings.md.

Step 3 — Improve agent

The coding agent edits agent/agent.py. It owns the entire file. Changes are kept focused — one hypothesis per iteration, small and reversible.

The coding agent must not modify benchmark.py, gating.py, record.py, prepare.py, experiment_config.yaml, or any file in agent/templates/. The file guard in gating.py enforces this via git diff and exits with code 1 if any tracked file outside the allowlist is touched.

Step 4 — Gate

python gating.py

Runs four steps in sequence:

Step	What it checks	Failure action
Step 0 — File guard	No tracked files outside `agent/agent.py` and `PROGRAM.md` were modified	Exit 1 immediately
Step 1 — Regression suite	Tasks in `suite.json` pass at rate ≥ `threshold`	Exit 1
Step 2 — Full test	`val_score` on the test split ≥ best recorded in `results.tsv`	Exit 1
Step 3 — Suite promotion	Re-runs previously-failing train tasks, promotes newly-passing ones into `suite.json`	(runs only if Steps 1+2 pass)

Exit code 0 means all steps passed. Exit code 1 means at least one step failed — revert agent/agent.py and try a different approach:

git checkout agent/agent.py

If the same hypothesis fails three times in a row, abandon it.

Step 5 — Record

After the gate exits 0, commit and record:

git add agent/agent.py
git commit -m "improve: <what changed and why>"
python record.py --val-score <val_score from Step 2 output> --evals-passed <n> --evals-total <m>

evals-passed and evals-total refer to the regression suite results from gate Step 1.

Step 6 — Update learnings

After every iteration — whether the gate passed or failed — the coding agent appends an entry to workspace/learnings.md:

## Iteration N — val_score: X.XX → Y.YY ✓/✗

**What changed:** <one sentence>

**Pattern confirmed:** <failure mode>

**What worked / didn't work:** <specifics>

**Needs from human:** <or "none">

This log is the agent’s persistent memory across sessions. It also surfaces requests to the human for things the agent cannot fix autonomously.

Step 7 — Repeat

Go to step 1. The agent stops when val_score has not improved for 5 consecutive iterations, at which point it writes a summary in learnings.md and surfaces its top findings.

Rules enforced by the base template

The base template defines 7 rules that the coding agent is instructed to follow:

Only edit agent/agent.py and workspace/learnings.md. The file guard enforces this at gate time — modifying any other tracked file fails immediately.
Never skip the gate. Every committed change must pass all three gate steps.
One hypothesis per iteration. Keep changes small and reversible.
Always update learnings.md. Even on failure; the log is the agent’s memory.
Never use test data to guide changes. Only train failures inform improvements.
Per-task timeouts count as failures. Any task that does not produce a verifier result within per_task_timeout scores 0.0. Consistent timeouts are a signal to simplify the prompt, not to ignore the missing reward.
Stop when val_score has not improved for 5 consecutive iterations. Write a summary and surface top findings to the human.

File formats

The coding agent reads and writes several workspace files during the loop. These formats are fixed by the infrastructure scripts.

`workspace/suite.json`

Managed automatically by gating.py. Do not edit.

{
  "tasks": ["<task-id>", "<task-id>"],
  "threshold": 0.8,
  "last_results": {
    "<task-id>": 1.0,
    "<task-id>": 1.0
  }
}

The tasks array grows as iterations fix previously-failing train tasks and both gate steps pass. The threshold value is set from experiment_config.yaml when prepare.py creates the file.

`workspace/train_results.json`

Written by benchmark.py. Do not edit.

{
  "split": "train",
  "timestamp": "<ISO 8601 timestamp>",
  "results": {
    "<task-id>": 1.0,
    "<task-id>": 0.0
  }
}

Reward values are floats in [0.0, 1.0]. A null value means the task timed out and the verifier did not run.

`workspace/results.tsv`

Tab-separated. Written by record.py. The coding agent reads this to determine the best val_score seen so far.

iteration	val_score	commit	evals_passed	evals_total	timestamp
0	0.4200	baseline	0	0	2025-01-01T00:00:00+00:00
1	0.4800	abc1234	4	5	2025-01-01T01:00:00+00:00

Iteration 0 is the baseline recorded by prepare.py.

Writing a benchmark-specific supplement

If you add a custom benchmark by subclassing BenchmarkRunner, you can provide a benchmark-specific supplement that prepare.py will append to the base template. Create program_templates/<your_benchmark>.md. The file is appended verbatim after the base content. A minimal supplement should cover:

Task ID format — how to reference tasks when running python benchmark.py --task-ids ...
Trace location — where the coding agent should read failure traces
What to edit in agent/agent.py — which classes, methods, or constants are the primary optimization targets
Benchmark-specific constraints — anything the coding agent must not do for this benchmark

---

## <Your Benchmark>: Benchmark-specific Guidance

### Task IDs

Task IDs are <format>. Run a subset with:

```bash
python benchmark.py --task-ids <id1> <id2>

Analyzing Failures (Step 2)

Read train traces here:

workspace/traces/latest/<task_id>/trace.json
workspace/traces/latest/<task_id>/result.json

Editing agent/agent.py (Step 3)

Focus changes on:

AGENT_INSTRUCTION — the system prompt
HarnessAgent.run() — the agent loop

Register the supplement in `prepare.py` by adding an entry to the `templates` dict in `copy_program_template()`.

<Info>
  The built-in supplements for Terminal-Bench, tau-bench, and BIRD-Interact live in `program_templates/` and serve as complete worked examples. Read them before writing your own.
</Info>

Configuration

API Reference

PROGRAM.md: loop instructions and benchmark supplements

How prepare.py composes PROGRAM.md

The 7-step loop

Step 1 — Run benchmark

Step 2 — Analyze failures

Step 3 — Improve agent

Step 4 — Gate

Step 5 — Record

Step 6 — Update learnings

Step 7 — Repeat

Rules enforced by the base template

File formats

`workspace/suite.json`

`workspace/train_results.json`

`workspace/results.tsv`

Writing a benchmark-specific supplement

Analyzing Failures (Step 2)

Editing agent/agent.py (Step 3)

Build docs developers (and LLMs) love

Configuration

API Reference

Documentation Index

​How prepare.py composes PROGRAM.md

​The 7-step loop

​Step 1 — Run benchmark

​Step 2 — Analyze failures

​Step 3 — Improve agent

​Step 4 — Gate

​Step 5 — Record

​Step 6 — Update learnings

​Step 7 — Repeat

​Rules enforced by the base template

​File formats

​workspace/suite.json

​workspace/train_results.json

​workspace/results.tsv

​Writing a benchmark-specific supplement

​Analyzing Failures (Step 2)

​Editing agent/agent.py (Step 3)

Build docs developers (and LLMs) love

How prepare.py composes PROGRAM.md

The 7-step loop

Step 1 — Run benchmark

Step 2 — Analyze failures

Step 3 — Improve agent

Step 4 — Gate

Step 5 — Record

Step 6 — Update learnings

Step 7 — Repeat

Rules enforced by the base template

File formats

`workspace/suite.json`

`workspace/train_results.json`

`workspace/results.tsv`

Writing a benchmark-specific supplement

Analyzing Failures (Step 2)

Editing agent/agent.py (Step 3)