Every change the coding agent makes toDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/neosigmaai/auto-harness/llms.txt
Use this file to discover all available pages before exploring further.
agent/agent.py must pass a four-step gate before it is committed and recorded. The gate is implemented in gating.py and runs as a single command. It exits 0 only when all steps clear; any failure returns exit code 1, which signals the agent to revert and try a different approach. This design means the optimization loop can run unsupervised — no change can land unless it improves or at minimum does not regress the benchmark.
Running the gate
experiment_config.yaml to determine which benchmark runners to use, then executes all four steps in sequence.
Step 0 — File guard
Before running any benchmark, the gate checks that the agent has only touched files it is allowed to modify. This is a fast, deterministic check — no network calls, no benchmark runs. The allowlist is defined as:git diff-index --name-only HEAD— files in the working tree that differ from HEADgit ls-files --others --exclude-standard— new untracked files not covered by.gitignore
ALLOWED_AGENT_FILES appears in either list, the gate prints the violations and returns exit code 1 immediately. No benchmark is run.
Files under
workspace/ are gitignored and therefore invisible to git. They are not checked by the file guard. The agent edits workspace/learnings.md freely without triggering Step 0.Disabling the file guard
Thefile_guard_enabled() function reads experiment_config.yaml. Set file_guard: false to bypass Step 0 for non-git environments or custom setups:
false, no, off, 0, or the empty string) disable it.
Step 1 — Regression suite
Step 1 re-runs the subset of train tasks listed inworkspace/suite.json and checks that the pass rate meets the threshold.
0.8 (80%). Tasks dropped silently by the runner count as failures — the denominator is always the number of tasks in suite.json, not the number of results returned.
Step 2 — Full test benchmark
Step 2 always runs the full benchmark on the test split.val_score is the mean reward across all test tasks, where None counts as 0.0.
best_val_score() scans workspace/results.tsv and returns the highest val_score recorded so far. On the first iteration, best is None and Step 2 always passes.
The test split is run inside Step 2 but its traces are never written to disk. This is the structural anti-cheating guarantee: the coding agent cannot read test failures and overfit to them.
Step 3 — Suite promotion
Step 3 only runs if both Step 1 and Step 2 passed. It identifies train tasks that were previously failing and are not yet insuite.json, re-runs them, and promotes any that now pass.
r >= 0.5, not None) to be promoted. Once promoted, a task stays in the suite permanently and must continue to pass in every future Step 1 check.
Exit codes
| Exit code | Meaning | Agent action |
|---|---|---|
0 | All steps passed | Commit agent/agent.py, run record.py |
1 | One or more steps failed | Revert with git checkout agent/agent.py, try a different approach |
Full gate output example
File guard in record.py
The file guard also runs insiderecord.py with check_last_commit=True. This additional check inspects the diff between HEAD and HEAD~1, catching cases where an agent commits forbidden files before invoking record.py. If violations are found, record.py prints a [record] prefixed failure message and exits 1 without writing to results.tsv.