The eval suite is a self-growing collection of benchmark tasks that the coding agent must continue to pass on every future iteration. It lives inDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/neosigmaai/auto-harness/llms.txt
Use this file to discover all available pages before exploring further.
workspace/suite.json and is managed entirely by gating.py — no manual curation is needed. When the agent fixes a previously-failing task and the change passes both the regression check and the full test gate, that task is automatically promoted into the suite. From that point on, any change that causes the task to fail again is rejected.
suite.json format
| Field | Type | Description |
|---|---|---|
tasks | string[] | Task IDs the agent must pass on every gate run |
threshold | float | Minimum pass rate required (default 0.8, i.e. 80%) |
last_results | object | Reward values from the most recent Step 1 run, keyed by task ID |
gating.py writes last_results after every Step 1 run so the gate output is queryable after the fact.
How tasks are promoted
Promotion happens at the end of a successful gate run, in Step 3. The gate identifies tasks fromworkspace/train_results.json that were previously failing (reward < 0.5 or None) and are not yet in suite.json. It re-runs those tasks on the train split:
>= 0.5. A None reward — which indicates a timeout or a run that produced no verifier output — cannot be promoted. This requirement prevents tasks from entering the suite based on a failed run that happened to produce no result.
How the suite grows over iterations
At the start of an experiment,suite.json contains an empty tasks list. As the agent improves agent/agent.py across iterations and more train tasks start passing, the suite accumulates them:
| Iteration | New tasks promoted | Total suite size |
|---|---|---|
| 0 (baseline) | 0 | 0 |
| 1 | 3 | 3 |
| 2 | 2 | 5 |
| 3 | 0 | 5 |
| 4 | 5 | 10 |
The threshold parameter
Thethreshold field controls what fraction of suite tasks must pass in Step 1 for the gate to proceed to recording. It defaults to 0.8, meaning 80% of suite tasks must pass.
The threshold is stored inside suite.json so it can be adjusted per experiment. To raise the bar to 90%:
Regression suite vs full benchmark
The eval suite and the full benchmark serve different purposes and run against different splits:Regression suite (Step 1 — train split)
Regression suite (Step 1 — train split)
Runs only the tasks in
suite.json against the train split. Fast, targeted, and designed to catch regressions in tasks the agent has already fixed. Pass rate must meet the threshold. The coding agent can inspect train traces for these tasks during failure analysis.Full benchmark (Step 2 — test split)
Full benchmark (Step 2 — test split)
Runs all tasks in the benchmark against the test split. Measures generalization: does the agent’s change improve the overall score, or does it just overfit to the train tasks already in the suite?
val_score must be greater than or equal to the best score recorded in results.tsv. Test traces are never saved to disk.Why the agent manages the suite
The eval suite is self-maintained by design. The coding agent decides which tasks belong insuite.json — not the human operator — because the agent is the one discovering which tasks it can reliably pass. Human curation would require the operator to run the benchmark, inspect results, and manually select tasks after each iteration, which defeats the purpose of an overnight autonomous loop.
The structural guarantee that prevents the agent from gaming this is Step 2: even if the agent somehow passed the regression suite with a narrow or cherry-picked set of tasks, the full test-split val_score still has to meet or beat the best score on record.