Gating: Four-Step Validation for Every Agent Change

Every change the coding agent makes to agent/agent.py must pass a four-step gate before it is committed and recorded. The gate is implemented in gating.py and runs as a single command. It exits 0 only when all steps clear; any failure returns exit code 1, which signals the agent to revert and try a different approach. This design means the optimization loop can run unsupervised — no change can land unless it improves or at minimum does not regress the benchmark.

Running the gate

python gating.py

The gate reads experiment_config.yaml to determine which benchmark runners to use, then executes all four steps in sequence.

Step 0 — File guard

Before running any benchmark, the gate checks that the agent has only touched files it is allowed to modify. This is a fast, deterministic check — no network calls, no benchmark runs. The allowlist is defined as:

ALLOWED_AGENT_FILES = frozenset({"agent/agent.py", "PROGRAM.md"})

The check uses two git commands:

git diff-index --name-only HEAD — files in the working tree that differ from HEAD
git ls-files --others --exclude-standard — new untracked files not covered by .gitignore

If any tracked file outside ALLOWED_AGENT_FILES appears in either list, the gate prints the violations and returns exit code 1 immediately. No benchmark is run.

[gate] Step 0: file guard
[gate] FAILED — file guard: 1 file(s) outside the allowlist
[gate]          allowed: PROGRAM.md, agent/agent.py  (workspace/ is gitignored — edit there freely)
[gate]            - benchmark.py
[gate]          revert with `git checkout -- <file>` (tracked) or `rm <file>` (untracked) and re-run.
[gate]          bypass: set `file_guard: false` in experiment_config.yaml.

Files under workspace/ are gitignored and therefore invisible to git. They are not checked by the file guard. The agent edits workspace/learnings.md freely without triggering Step 0.

Disabling the file guard

The file_guard_enabled() function reads experiment_config.yaml. Set file_guard: false to bypass Step 0 for non-git environments or custom setups:

file_guard: false

The function is conservative by design: a typo, empty value, or missing key all leave the guard enabled. Only explicit falsy values (false, no, off, 0, or the empty string) disable it.

Step 1 — Regression suite

Step 1 re-runs the subset of train tasks listed in workspace/suite.json and checks that the pass rate meets the threshold.

passed = sum(1 for tid in task_ids if (r := results.get(tid)) is not None and r >= 0.5)
pass_rate = passed / denominator  # denominator = len(task_ids)
suite_passed = pass_rate >= threshold

The default threshold is 0.8 (80%). Tasks dropped silently by the runner count as failures — the denominator is always the number of tasks in suite.json, not the number of results returned.

[gate] Step 1: eval suite (12 tasks, threshold=80%)
       10/12 passed (83%)  PASS ✓

If the suite is empty on the first iteration, Step 1 is skipped and treated as a pass.

Step 1 and Step 2 always both run regardless of each other’s outcome. A failure in Step 1 does not short-circuit Step 2. Both results are logged before the gate returns exit 1.

Step 2 — Full test benchmark

Step 2 always runs the full benchmark on the test split. val_score is the mean reward across all test tasks, where None counts as 0.0.

val = gate_runner.val_score(all_results)
best = best_val_score()  # reads workspace/results.tsv

test_passed = best is None or val >= best

best_val_score() scans workspace/results.tsv and returns the highest val_score recorded so far. On the first iteration, best is None and Step 2 always passes.

[gate] Step 2: full benchmark (test split)
       val_score=0.7823  PASS ✓  (prev best: 0.7701)

The test split is run inside Step 2 but its traces are never written to disk. This is the structural anti-cheating guarantee: the coding agent cannot read test failures and overfit to them.

Step 3 — Suite promotion

Step 3 only runs if both Step 1 and Step 2 passed. It identifies train tasks that were previously failing and are not yet in suite.json, re-runs them, and promotes any that now pass.

failing_non_suite = [tid for tid, r in train_results.items()
                     if (r is None or r < 0.5) and tid not in suite_set]
# ...
newly_fixed = sorted(tid for tid, r in recheck.items() if r is not None and r >= 0.5)
if newly_fixed:
    suite["tasks"] = sorted(suite_set | set(newly_fixed))
    save_suite(suite)

A task requires a real verifier pass (r >= 0.5, not None) to be promoted. Once promoted, a task stays in the suite permanently and must continue to pass in every future Step 1 check.

[gate] Step 3: suite promotion
       re-running 7 previously-failing train tasks
       promoted 2 task(s) into regression suite: ['task_042', 'task_071']

Exit codes

Exit code	Meaning	Agent action
`0`	All steps passed	Commit `agent/agent.py`, run `record.py`
`1`	One or more steps failed	Revert with `git checkout agent/agent.py`, try a different approach

Full gate output example

[gate] Step 1: eval suite (12 tasks, threshold=80%)
       10/12 passed (83%)  PASS ✓

[gate] Step 2: full benchmark (test split)
       val_score=0.7823  PASS ✓  (prev best: 0.7701)

[gate] Step 3: suite promotion
       re-running 7 previously-failing train tasks
       promoted 2 task(s) into regression suite: ['task_042', 'task_071']

[gate] PASSED ✓  All steps clear. (val_score=0.7823)

File guard in record.py

The file guard also runs inside record.py with check_last_commit=True. This additional check inspects the diff between HEAD and HEAD~1, catching cases where an agent commits forbidden files before invoking record.py. If violations are found, record.py prints a [record] prefixed failure message and exits 1 without writing to results.tsv.

Get Started

Core Concepts

Benchmarks

Extending

Gating: Four-Step Validation for Every Agent Change

Running the gate

Step 0 — File guard

Disabling the file guard

Step 1 — Regression suite

Step 2 — Full test benchmark

Step 3 — Suite promotion

Exit codes

Full gate output example

File guard in record.py

Build docs developers (and LLMs) love

Get Started

Core Concepts

Benchmarks

Extending

Documentation Index

​Running the gate

​Step 0 — File guard

​Disabling the file guard

​Step 1 — Regression suite

​Step 2 — Full test benchmark

​Step 3 — Suite promotion

​Exit codes

​Full gate output example

​File guard in record.py

Build docs developers (and LLMs) love

Running the gate

Step 0 — File guard

Disabling the file guard

Step 1 — Regression suite

Step 2 — Full test benchmark

Step 3 — Suite promotion

Exit codes

Full gate output example

File guard in record.py