Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/neosigmaai/auto-harness/llms.txt

Use this file to discover all available pages before exploring further.

record.py appends one iteration result to workspace/results.tsv after a change has passed the gate and been committed. It enforces the file guard a second time — including an inspection of the most recent commit — so an agent that committed forbidden files before invoking record cannot slip through undetected. The script is typically called by the coding agent as the final step of each optimization iteration.

CLI usage

python record.py --val-score 0.82 --evals-passed 8 --evals-total 10
All three arguments are required. The script exits 0 on success and 1 if the file guard rejects the call.
ArgumentTypeDescription
--val-scorefloatMean reward on the full test set from the most recent gate run
--evals-passedintNumber of eval suite tasks that passed
--evals-totalintTotal number of eval suite tasks

Output format

Each call appends one tab-separated row to workspace/results.tsv. The file is created by prepare.py with the following header:
iteration	val_score	commit	evals_passed	evals_total	timestamp
A recorded row looks like:
1	0.8200	a3f91bc	8	10	2024-11-05T14:23:01+00:00
iteration
int
Auto-incremented iteration number, starting at 1. Iteration 0 is the baseline recorded by prepare.py.
val_score
float
Mean reward on the full test set, formatted to 4 decimal places.
commit
str
Git short hash of HEAD at the time record was called, or "unknown" if git is unavailable.
evals_passed
int
Number of eval suite tasks that passed the most recent gate run.
evals_total
int
Total number of eval suite tasks evaluated.
timestamp
str
ISO 8601 UTC timestamp with seconds precision (e.g. 2024-11-05T14:23:01+00:00).

record

def record(val_score: float, evals_passed: int, evals_total: int) -> int
Append one iteration row to workspace/results.tsv. Runs the file guard before writing.
val_score
float
required
Mean reward on the full test set.
evals_passed
int
required
Number of eval suite tasks that passed.
evals_total
int
required
Total number of eval suite tasks.
Returns: 0 on success, 1 if the file guard rejects the call (violations are printed to stdout).
from record import record

exit_code = record(val_score=0.82, evals_passed=8, evals_total=10)
# Prints: [record] iteration 1: val_score=0.8200, evals=8/10, commit=a3f91bc

File guard behavior

record calls file_guard_violations(check_last_commit=True). This means it inspects:
  1. Files in the working tree that differ from HEAD.
  2. Untracked files not covered by .gitignore.
  3. Files changed in the most recent commit (HEAD vs HEAD~1).
Any path outside ALLOWED_AGENT_FILES = {"agent/agent.py", "PROGRAM.md"} causes the function to print a detailed error message and return 1 without writing to results.tsv.
Passing check_last_commit=True means that committing a forbidden file and then calling record.py will still be caught. The file guard cannot be bypassed by committing changes first.

current_commit

def current_commit() -> str
Return the short git hash of HEAD. Returns: The output of git rev-parse --short HEAD as a string, or "unknown" if git is unavailable or the command fails.
from record import current_commit

sha = current_commit()  # e.g. "a3f91bc"

next_iteration

def next_iteration() -> int
Determine the next iteration number by counting data rows in workspace/results.tsv. The header line (starting with "iteration") is excluded. Iteration 0 is the baseline row written by prepare.py. Returns: An int equal to the number of existing data rows (i.e., the next iteration number to assign). Returns 1 if results.tsv does not exist.
from record import next_iteration

n = next_iteration()  # 1 after prepare.py runs, 2 after the first successful iteration, etc.

Example workflow

The standard usage pattern within the optimization loop:
# 1. Modify agent/agent.py
# 2. Run the gate
python gating.py   # must exit 0

# 3. Commit the change
git add agent/agent.py
git commit -m "iteration 1: improve tool selection"

# 4. Record the result (val_score and evals come from the gate output)
python record.py --val-score 0.82 --evals-passed 8 --evals-total 10
val_score and the evals numbers should come directly from the [gate] Step 1 and [gate] Step 2 output of the most recent gating.py run. Do not re-run the benchmark separately before calling record.py.

Build docs developers (and LLMs) love