Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/neosigmaai/auto-harness/llms.txt

Use this file to discover all available pages before exploring further.

The workspace/ directory is the shared state layer between the coding agent and the harness infrastructure. It holds the iteration history, benchmark results, agent traces, and the regression suite. All files under workspace/ are gitignored, which means they are invisible to git diff, git ls-files, and the file guard in gating.py. The agent can read and write workspace files without triggering a gate violation.

File overview

suite.json

Regression eval suite. Managed by gating.py. Contains the task IDs the agent must pass on every gate run, the pass rate threshold, and the last Step 1 results.

learnings.md

Per-iteration log. The agent appends after every iteration — pass or fail. Accumulates patterns, hypotheses, and requests to the human operator across the full experiment.

results.tsv

Iteration history. Written by record.py after each successful gate. Contains val_score, commit hash, suite pass counts, and timestamp for every recorded iteration.

train_results.json

Last full train benchmark results. Written by benchmark.py. Used by Step 3 of gating.py to identify which train tasks are still failing and eligible for suite promotion.

File-by-file reference

workspace/suite.json

Managed automatically by gating.py. The coding agent must not edit this file directly — it is a tracked file write that would trigger the file guard.
{
  "tasks": ["<task-id>", "<task-id>"],
  "threshold": 0.8,
  "last_results": {
    "<task-id>": 1.0,
    "<task-id>": 1.0
  }
}
tasks grows as iterations fix previously-failing train tasks and both gate steps pass. The suite starts empty and is never trimmed. Written by: gating.py (Steps 1 and 3) Read by: gating.py at the start of every gate run

workspace/learnings.md

The agent’s persistent memory across all iterations. The recommended entry format is:
## Iteration N — val_score: X.XX → Y.YY ✓/✗

**What changed:** <one sentence>

**Pattern confirmed:** <failure mode>

**What worked / didn't work:** <specifics>

**Needs from human:** <or "none">
Because workspace/ is gitignored, learnings.md is freely editable. It does not appear in git diff and is never checked by the file guard. The agent appends after every iteration — including failed gate runs — so the log captures the full optimization trajectory, not just the successful commits. Written by: Coding agent (appended after every iteration) Read by: Coding agent at the start of each iteration for context

workspace/results.tsv

Tab-separated iteration history. record.py appends one row per successful gate. The header row is written on first use.
iteration	val_score	commit	evals_passed	evals_total	timestamp
0	0.XXXX	baseline	0	0	<timestamp>
1	0.XXXX	abc1234	4	5	<timestamp>
ColumnDescription
iterationAuto-incremented integer, 0-indexed from baseline
val_scoreMean reward on the full test split, formatted to 4 decimal places
commitShort git commit hash (git rev-parse --short HEAD)
evals_passedNumber of regression suite tasks that passed in Step 1
evals_totalTotal tasks in the regression suite at time of recording
timestampUTC ISO 8601 timestamp with second precision
gating.py reads results.tsv via best_val_score() in Step 2 to determine the score floor for the current gate run. Written by: record.py Read by: gating.py (to find the best val_score on record)

workspace/train_results.json

Written by benchmark.py after every full train run. Stores the reward for each task in the train split.
{
  "split": "train",
  "timestamp": "<ISO 8601 timestamp>",
  "results": {
    "<task-id>": 1.0,
    "<task-id>": 0.0,
    "<task-id>": null
  }
}
A null value means the task timed out or produced no verifier output. gating.py reads this file in Step 3 to find tasks that are still failing and could be candidates for suite promotion. Written by: benchmark.py Read by: gating.py (Step 3 — suite promotion), coding agent (for failure analysis)

workspace/traces/latest/

Contains the conversation traces from the most recent train-split benchmark run. The coding agent reads these traces in the “Analyze failures” step to diagnose why tasks failed.
Only train-split traces are saved here. Test-split execution runs inside gating.py (Step 2) and its traces are never written to disk. This is the structural anti-cheating guarantee: the agent cannot read test failures and tune against them.
Written by: benchmark.py (train runs only) Read by: Coding agent during failure analysis

workspace/traces/baseline/

Contains the traces from the very first benchmark run, captured by prepare.py during workspace initialization. These are never overwritten by subsequent runs. Baseline traces give the agent a stable reference point: what the original agent did before any optimization. They are useful for comparing the starting behavior against the current agent/agent.py. Written by: prepare.py (once, at setup time) Read by: Coding agent (optional reference)

Edit permissions summary

FileAgent can editWho writes it
workspace/learnings.mdYes — freelyCoding agent
workspace/suite.jsonNo — read-onlygating.py
workspace/results.tsvNo — read-onlyrecord.py
workspace/train_results.jsonNo — read-onlybenchmark.py
workspace/traces/latest/Nobenchmark.py
workspace/traces/baseline/Noprepare.py

Why workspace/ is gitignored

Keeping workspace/ out of git serves two purposes. First, it lets the agent write learnings.md freely at any point during the loop without affecting the file guard check in gating.py. If learnings.md were tracked, the agent would need to commit it (or stash it) before every gate run to avoid a Step 0 violation. Second, benchmark traces, result files, and the suite state are experiment-local artifacts that should not be versioned alongside the agent code. The git history stays clean: each commit represents exactly one change to agent/agent.py with a meaningful message, making it easy to diff or bisect the optimization trajectory.

Build docs developers (and LLMs) love