Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/neosigmaai/auto-harness/llms.txt

Use this file to discover all available pages before exploring further.

gating.py implements the multi-step quality gate that must pass before each agent iteration is recorded as a success. It enforces a file-edit guard, re-runs an eval suite on the training split, runs the full benchmark on the test split, and promotes newly-passing tasks into the suite. The gate is typically invoked automatically by the coding agent, but can also be run directly from the command line.

Gate pipeline overview

When run_gate is called, it executes up to four steps in order:
StepNameWhat it checks
0File guardNo tracked files outside ALLOWED_AGENT_FILES were modified
1Eval suiteRe-runs workspace/suite.json tasks; pass rate ≥ threshold
2Full benchmarkRuns the test split; val_score ≥ best value in results.tsv
3Suite promotionNewly-passing train tasks are added to workspace/suite.json
Exit 0 is returned only if all steps pass. Any failure returns 1.

Constants

ALLOWED_AGENT_FILES = frozenset({"agent/agent.py", "PROGRAM.md"})

SUITE_FILE         = "workspace/suite.json"
RESULTS_FILE       = "workspace/results.tsv"
TRAIN_RESULTS_FILE = "workspace/train_results.json"
CONFIG_FILE        = "experiment_config.yaml"
Everything under workspace/ is gitignored and therefore invisible to git. The file guard only inspects files that git can see, so edits to workspace/learnings.md and similar files are always permitted.

run_gate

def run_gate(train_runner: BenchmarkRunner, gate_runner: BenchmarkRunner) -> int
Execute all gate steps using the provided runners and return a Unix exit code.
train_runner
BenchmarkRunner
required
Runner used for the eval suite (Step 1) and suite promotion (Step 3). Should be configured for the training split.
gate_runner
BenchmarkRunner
required
Runner used for the full benchmark (Step 2). Should be configured for the test split.
Returns: 0 on success (all steps passed), 1 on any failure.

Example

from benchmark import TauBenchRunner
from gating import run_gate

train_runner = TauBenchRunner(domain="retail", split="train")
gate_runner  = TauBenchRunner(domain="retail", split="test")

exit_code = run_gate(train_runner, gate_runner)
# 0 → all steps passed, 1 → gate failed

file_guard_violations

def file_guard_violations(*, check_last_commit: bool = False) -> list[str]
Return a sorted list of tracked paths the agent has touched outside ALLOWED_AGENT_FILES. Always inspects:
  • git diff-index --name-only HEAD — files in the working tree that differ from HEAD.
  • git ls-files --others --exclude-standard — untracked files not covered by .gitignore.
With check_last_commit=True, also inspects the diff of HEAD vs HEAD~1. This is used by record.py to catch agents that commit forbidden files before invoking record.
check_last_commit
bool
default:"False"
When True, additionally checks whether the most recent commit (HEAD vs HEAD~1) touched any files outside the allowlist. Silently skipped when there is no parent commit.
Returns: Sorted list of violating paths. Returns [] if there are no violations, or if git is unavailable (a one-time warning is printed to stderr in that case).
from gating import file_guard_violations

violations = file_guard_violations()
if violations:
    print("Files modified outside allowlist:", violations)

# Also check last commit (used by record.py)
violations = file_guard_violations(check_last_commit=True)

file_guard_enabled

def file_guard_enabled() -> bool
Returns True by default. The file guard is disabled only by explicit opt-out in experiment_config.yaml. To disable the guard, add this to experiment_config.yaml:
file_guard: false
Accepted falsy values: false, no, off, 0, "" (case-insensitive). Any other value — including a missing key, null, or unrecognized strings — leaves the guard on. This conservative default means a typo will not silently disable a safety check. Returns: True if the file guard is active, False if explicitly disabled.
Disabling the file guard allows agent iterations to modify arbitrary tracked files. Only disable it if you are certain the agent should have unrestricted write access to the repository.

load_suite

def load_suite() -> dict
Load the eval suite from workspace/suite.json. Returns: A dict with the following structure. Returns a default empty suite if the file does not exist.
{
  "tasks": [],
  "threshold": 0.8,
  "last_results": {}
}
tasks
list[str]
Task IDs currently in the regression suite.
threshold
float
Minimum pass rate required to pass Step 1. Default 0.8 (80%).
last_results
dict[str, float | null]
Per-task rewards from the most recent Step 1 run.

save_suite

def save_suite(suite: dict) -> None
Write the eval suite back to workspace/suite.json.
suite
dict
required
The suite dict, as returned by load_suite (and typically modified in-place by run_gate).

best_val_score

def best_val_score() -> float | None
Return the highest val_score recorded in workspace/results.tsv. Returns: The maximum val_score as a float, or None if results.tsv does not exist or contains no data rows.
from gating import best_val_score

best = best_val_score()
if best is not None:
    print(f"Best val_score so far: {best:.4f}")
else:
    print("No iterations recorded yet.")

load_config

def load_config() -> dict
Load experiment_config.yaml from the current working directory. Returns: A dict of the parsed YAML contents, or {} if the file does not exist.

CLI usage

Running gating.py directly reads experiment_config.yaml, constructs the appropriate train and gate runners for the configured benchmark, and runs all gate steps:
python gating.py
Exit code mirrors run_gate: 0 for all steps passed, 1 for any failure.
The coding agent typically invokes gating.py automatically as part of the optimization loop defined in PROGRAM.md. You can run it manually to verify that your current agent/agent.py passes the gate before calling record.py.

Build docs developers (and LLMs) love