gating.py API: run_gate, file_guard, suite functions

gating.py implements the multi-step quality gate that must pass before each agent iteration is recorded as a success. It enforces a file-edit guard, re-runs an eval suite on the training split, runs the full benchmark on the test split, and promotes newly-passing tasks into the suite. The gate is typically invoked automatically by the coding agent, but can also be run directly from the command line.

Gate pipeline overview

When run_gate is called, it executes up to four steps in order:

Step	Name	What it checks
0	File guard	No tracked files outside `ALLOWED_AGENT_FILES` were modified
1	Eval suite	Re-runs `workspace/suite.json` tasks; pass rate ≥ threshold
2	Full benchmark	Runs the test split; `val_score` ≥ best value in `results.tsv`
3	Suite promotion	Newly-passing train tasks are added to `workspace/suite.json`

Exit 0 is returned only if all steps pass. Any failure returns 1.

Constants

ALLOWED_AGENT_FILES = frozenset({"agent/agent.py", "PROGRAM.md"})

SUITE_FILE         = "workspace/suite.json"
RESULTS_FILE       = "workspace/results.tsv"
TRAIN_RESULTS_FILE = "workspace/train_results.json"
CONFIG_FILE        = "experiment_config.yaml"

Everything under workspace/ is gitignored and therefore invisible to git. The file guard only inspects files that git can see, so edits to workspace/learnings.md and similar files are always permitted.

`run_gate`

def run_gate(train_runner: BenchmarkRunner, gate_runner: BenchmarkRunner) -> int

Execute all gate steps using the provided runners and return a Unix exit code.

train_runner

BenchmarkRunner

required

Runner used for the eval suite (Step 1) and suite promotion (Step 3). Should be configured for the training split.

gate_runner

BenchmarkRunner

required

Runner used for the full benchmark (Step 2). Should be configured for the test split.

Returns: 0 on success (all steps passed), 1 on any failure.

Example

from benchmark import TauBenchRunner
from gating import run_gate

train_runner = TauBenchRunner(domain="retail", split="train")
gate_runner  = TauBenchRunner(domain="retail", split="test")

exit_code = run_gate(train_runner, gate_runner)
# 0 → all steps passed, 1 → gate failed

`file_guard_violations`

def file_guard_violations(*, check_last_commit: bool = False) -> list[str]

Return a sorted list of tracked paths the agent has touched outside ALLOWED_AGENT_FILES. Always inspects:

git diff-index --name-only HEAD — files in the working tree that differ from HEAD.
git ls-files --others --exclude-standard — untracked files not covered by .gitignore.

With check_last_commit=True, also inspects the diff of HEAD vs HEAD~1. This is used by record.py to catch agents that commit forbidden files before invoking record.

check_last_commit

bool

default:"False"

When True, additionally checks whether the most recent commit (HEAD vs HEAD~1) touched any files outside the allowlist. Silently skipped when there is no parent commit.

Returns: Sorted list of violating paths. Returns [] if there are no violations, or if git is unavailable (a one-time warning is printed to stderr in that case).

from gating import file_guard_violations

violations = file_guard_violations()
if violations:
    print("Files modified outside allowlist:", violations)

# Also check last commit (used by record.py)
violations = file_guard_violations(check_last_commit=True)

`file_guard_enabled`

def file_guard_enabled() -> bool

Returns True by default. The file guard is disabled only by explicit opt-out in experiment_config.yaml. To disable the guard, add this to experiment_config.yaml:

file_guard: false

Accepted falsy values: false, no, off, 0, "" (case-insensitive). Any other value — including a missing key, null, or unrecognized strings — leaves the guard on. This conservative default means a typo will not silently disable a safety check. Returns: True if the file guard is active, False if explicitly disabled.

Disabling the file guard allows agent iterations to modify arbitrary tracked files. Only disable it if you are certain the agent should have unrestricted write access to the repository.

`load_suite`

def load_suite() -> dict

Load the eval suite from workspace/suite.json. Returns: A dict with the following structure. Returns a default empty suite if the file does not exist.

{
  "tasks": [],
  "threshold": 0.8,
  "last_results": {}
}

tasks

list[str]

Task IDs currently in the regression suite.

threshold

float

Minimum pass rate required to pass Step 1. Default 0.8 (80%).

last_results

dict[str, float | null]

Per-task rewards from the most recent Step 1 run.

`save_suite`

def save_suite(suite: dict) -> None

Write the eval suite back to workspace/suite.json.

suite

dict

required

The suite dict, as returned by load_suite (and typically modified in-place by run_gate).

`best_val_score`

def best_val_score() -> float | None

Return the highest val_score recorded in workspace/results.tsv. Returns: The maximum val_score as a float, or None if results.tsv does not exist or contains no data rows.

from gating import best_val_score

best = best_val_score()
if best is not None:
    print(f"Best val_score so far: {best:.4f}")
else:
    print("No iterations recorded yet.")

`load_config`

def load_config() -> dict

Load experiment_config.yaml from the current working directory. Returns: A dict of the parsed YAML contents, or {} if the file does not exist.

CLI usage

Running gating.py directly reads experiment_config.yaml, constructs the appropriate train and gate runners for the configured benchmark, and runs all gate steps:

python gating.py

Exit code mirrors run_gate: 0 for all steps passed, 1 for any failure.

The coding agent typically invokes gating.py automatically as part of the optimization loop defined in PROGRAM.md. You can run it manually to verify that your current agent/agent.py passes the gate before calling record.py.

Configuration

API Reference

gating.py API: run_gate, file_guard, suite functions

Gate pipeline overview

Constants

`run_gate`

Example

`file_guard_violations`

`file_guard_enabled`

`load_suite`

`save_suite`

`best_val_score`

`load_config`

CLI usage

Build docs developers (and LLMs) love

Configuration

API Reference

Documentation Index

​Gate pipeline overview

​Constants

​run_gate

​Example

​file_guard_violations

​file_guard_enabled

​load_suite

​save_suite

​best_val_score

​load_config

​CLI usage

Build docs developers (and LLMs) love

Gate pipeline overview

Constants

`run_gate`

Example

`file_guard_violations`

`file_guard_enabled`

`load_suite`

`save_suite`

`best_val_score`

`load_config`

CLI usage