record.py: append iteration results to results.tsv

record.py appends one iteration result to workspace/results.tsv after a change has passed the gate and been committed. It enforces the file guard a second time — including an inspection of the most recent commit — so an agent that committed forbidden files before invoking record cannot slip through undetected. The script is typically called by the coding agent as the final step of each optimization iteration.

CLI usage

python record.py --val-score 0.82 --evals-passed 8 --evals-total 10

All three arguments are required. The script exits 0 on success and 1 if the file guard rejects the call.

Argument	Type	Description
`--val-score`	`float`	Mean reward on the full test set from the most recent gate run
`--evals-passed`	`int`	Number of eval suite tasks that passed
`--evals-total`	`int`	Total number of eval suite tasks

Output format

Each call appends one tab-separated row to workspace/results.tsv. The file is created by prepare.py with the following header:

iteration	val_score	commit	evals_passed	evals_total	timestamp

A recorded row looks like:

1	0.8200	a3f91bc	8	10	2024-11-05T14:23:01+00:00

iteration

int

Auto-incremented iteration number, starting at 1. Iteration 0 is the baseline recorded by prepare.py.

val_score

float

Mean reward on the full test set, formatted to 4 decimal places.

commit

str

Git short hash of HEAD at the time record was called, or "unknown" if git is unavailable.

evals_passed

int

Number of eval suite tasks that passed the most recent gate run.

evals_total

int

Total number of eval suite tasks evaluated.

timestamp

str

ISO 8601 UTC timestamp with seconds precision (e.g. 2024-11-05T14:23:01+00:00).

`record`

def record(val_score: float, evals_passed: int, evals_total: int) -> int

Append one iteration row to workspace/results.tsv. Runs the file guard before writing.

val_score

float

required

Mean reward on the full test set.

evals_passed

int

required

Number of eval suite tasks that passed.

evals_total

int

required

Total number of eval suite tasks.

Returns: 0 on success, 1 if the file guard rejects the call (violations are printed to stdout).

from record import record

exit_code = record(val_score=0.82, evals_passed=8, evals_total=10)
# Prints: [record] iteration 1: val_score=0.8200, evals=8/10, commit=a3f91bc

File guard behavior

record calls file_guard_violations(check_last_commit=True). This means it inspects:

Files in the working tree that differ from HEAD.
Untracked files not covered by .gitignore.
Files changed in the most recent commit (HEAD vs HEAD~1).

Any path outside ALLOWED_AGENT_FILES = {"agent/agent.py", "PROGRAM.md"} causes the function to print a detailed error message and return 1 without writing to results.tsv.

Passing check_last_commit=True means that committing a forbidden file and then calling record.py will still be caught. The file guard cannot be bypassed by committing changes first.

`current_commit`

def current_commit() -> str

Return the short git hash of HEAD. Returns: The output of git rev-parse --short HEAD as a string, or "unknown" if git is unavailable or the command fails.

from record import current_commit

sha = current_commit()  # e.g. "a3f91bc"

`next_iteration`

def next_iteration() -> int

Determine the next iteration number by counting data rows in workspace/results.tsv. The header line (starting with "iteration") is excluded. Iteration 0 is the baseline row written by prepare.py. Returns: An int equal to the number of existing data rows (i.e., the next iteration number to assign). Returns 1 if results.tsv does not exist.

from record import next_iteration

n = next_iteration()  # 1 after prepare.py runs, 2 after the first successful iteration, etc.

Example workflow

The standard usage pattern within the optimization loop:

# 1. Modify agent/agent.py
# 2. Run the gate
python gating.py   # must exit 0

# 3. Commit the change
git add agent/agent.py
git commit -m "iteration 1: improve tool selection"

# 4. Record the result (val_score and evals come from the gate output)
python record.py --val-score 0.82 --evals-passed 8 --evals-total 10

val_score and the evals numbers should come directly from the [gate] Step 1 and [gate] Step 2 output of the most recent gating.py run. Do not re-run the benchmark separately before calling record.py.

Configuration

API Reference

record.py: append iteration results to results.tsv

CLI usage

Output format

`record`

File guard behavior

`current_commit`

`next_iteration`

Example workflow

Build docs developers (and LLMs) love

Configuration

API Reference

Documentation Index

​CLI usage

​Output format

​record

​File guard behavior

​current_commit

​next_iteration

​Example workflow

Build docs developers (and LLMs) love

CLI usage

Output format

`record`

File guard behavior

`current_commit`

`next_iteration`

Example workflow