Eval Suite: self-maintained regression test collection

The eval suite is a self-growing collection of benchmark tasks that the coding agent must continue to pass on every future iteration. It lives in workspace/suite.json and is managed entirely by gating.py — no manual curation is needed. When the agent fixes a previously-failing task and the change passes both the regression check and the full test gate, that task is automatically promoted into the suite. From that point on, any change that causes the task to fail again is rejected.

suite.json format

{
  "tasks": ["<task-id>", "<task-id>"],
  "threshold": 0.8,
  "last_results": {
    "<task-id>": 1.0,
    "<task-id>": 1.0
  }
}

Field	Type	Description
`tasks`	`string[]`	Task IDs the agent must pass on every gate run
`threshold`	`float`	Minimum pass rate required (default `0.8`, i.e. 80%)
`last_results`	`object`	Reward values from the most recent Step 1 run, keyed by task ID

gating.py writes last_results after every Step 1 run so the gate output is queryable after the fact.

How tasks are promoted

Promotion happens at the end of a successful gate run, in Step 3. The gate identifies tasks from workspace/train_results.json that were previously failing (reward < 0.5 or None) and are not yet in suite.json. It re-runs those tasks on the train split:

failing_non_suite = [tid for tid, r in train_results.items()
                     if (r is None or r < 0.5) and tid not in suite_set]
recheck = train_runner.run(task_ids=failing_non_suite)
newly_fixed = sorted(tid for tid, r in recheck.items() if r is not None and r >= 0.5)
if newly_fixed:
    suite["tasks"] = sorted(suite_set | set(newly_fixed))

A task is only promoted if it returns a real verifier result of >= 0.5. A None reward — which indicates a timeout or a run that produced no verifier output — cannot be promoted. This requirement prevents tasks from entering the suite based on a failed run that happened to produce no result.

Suite promotion only runs if both Step 1 (regression suite) and Step 2 (full test) passed. If either gate step fails, the suite is not updated and the iteration is reverted.

How the suite grows over iterations

At the start of an experiment, suite.json contains an empty tasks list. As the agent improves agent/agent.py across iterations and more train tasks start passing, the suite accumulates them:

Iteration	New tasks promoted	Total suite size
0 (baseline)	0	0
1	3	3
2	2	5
3	0	5
4	5	10

The suite never shrinks. Once a task is promoted, it stays in the suite permanently. This means the regression check in Step 1 becomes progressively stricter as the agent improves, anchoring all future changes against the accumulated wins.

The threshold parameter

The threshold field controls what fraction of suite tasks must pass in Step 1 for the gate to proceed to recording. It defaults to 0.8, meaning 80% of suite tasks must pass. The threshold is stored inside suite.json so it can be adjusted per experiment. To raise the bar to 90%:

{
  "tasks": ["task_001", "task_002"],
  "threshold": 0.9,
  "last_results": {}
}

suite.json is read-only from the coding agent’s perspective. gating.py is the only writer. If the agent modifies this file directly, the file guard in Step 0 will reject the change because suite.json is a tracked file outside ALLOWED_AGENT_FILES.

Regression suite vs full benchmark

The eval suite and the full benchmark serve different purposes and run against different splits:

Regression suite (Step 1 — train split)

Runs only the tasks in suite.json against the train split. Fast, targeted, and designed to catch regressions in tasks the agent has already fixed. Pass rate must meet the threshold. The coding agent can inspect train traces for these tasks during failure analysis.

Full benchmark (Step 2 — test split)

Runs all tasks in the benchmark against the test split. Measures generalization: does the agent’s change improve the overall score, or does it just overfit to the train tasks already in the suite? val_score must be greater than or equal to the best score recorded in results.tsv. Test traces are never saved to disk.

The two checks are complementary. The regression suite protects previously-fixed tasks from breaking. The full test gate ensures every committed change is a net improvement on unseen data.

Why the agent manages the suite

The eval suite is self-maintained by design. The coding agent decides which tasks belong in suite.json — not the human operator — because the agent is the one discovering which tasks it can reliably pass. Human curation would require the operator to run the benchmark, inspect results, and manually select tasks after each iteration, which defeats the purpose of an overnight autonomous loop. The structural guarantee that prevents the agent from gaming this is Step 2: even if the agent somehow passed the regression suite with a narrow or cherry-picked set of tasks, the full test-split val_score still has to meet or beat the best score on record.

Get Started

Core Concepts

Benchmarks

Extending

Eval Suite: self-maintained regression test collection

suite.json format

How tasks are promoted

How the suite grows over iterations

The threshold parameter

Regression suite vs full benchmark

Why the agent manages the suite

Build docs developers (and LLMs) love

Get Started

Core Concepts

Benchmarks

Extending

Documentation Index

​suite.json format

​How tasks are promoted

​How the suite grows over iterations

​The threshold parameter

​Regression suite vs full benchmark

​Why the agent manages the suite

Build docs developers (and LLMs) love

suite.json format

How tasks are promoted

How the suite grows over iterations

The threshold parameter

Regression suite vs full benchmark

Why the agent manages the suite