Fork sandboxes for parallel coding agent evaluations

The coding agent recipes cover the SWE-bench evaluation pattern: N agents working on N isolated repository checkouts in parallel, each running git clone, pip install, and pytest in their own microVM — no shared filesystem, no shared process state, no interference between workers. The recipes/coding-agent/ directory provides the parent snapshot, and recipes/coding-agent-fork/ demonstrates the BRANCH variant where a binary blob too large to fit in a prompt is distributed byte-identically across four grandchildren.

Two patterns

`coding-agent/` — fork-per-task evaluation harness

Each child inherits a fully warmed dev environment from the parent snapshot: Python 3.12, git, gh (GitHub CLI), build-essential, make, ruff, black, mypy, pytest, and requests. Fork a child per task, clone the repo, install dependencies, run the test suite — all in an isolated KVM microVM.

sudo forkd exec --child forkd-child-1 -- \
    bash -c "git clone https://github.com/psf/requests /tmp/r && cd /tmp/r && pytest -q"

This is the right shape for SWE-bench-lite style evaluation runs: hundreds of parallel workers, each checking out a different repository state, with no risk of one worker’s filesystem mutations affecting another.

`coding-agent-fork/` — BRANCH distributes large binary state

This recipe answers the “but couldn’t you just parallel-prompt the LLM?” objection directly. A source agent builds a Python package, runs a failing test suite (populating __pycache__/), and writes a 50 MiB synthetic binary (vendored.bin) representing real agent-accumulated state: pip caches, downloaded weights, compiled extensions. The source is BRANCHed; three grandchildren each apply a different fix strategy. The key properties the demo verifies:

The 50 MiB vendored.bin is byte-identical across all four sandboxes (md5 verified)
The __pycache__/ directory is byte-identical at branch time — each child’s own __pycache__/ diverges only after they re-import with their modified source
Each child’s mathy/__init__.py after applying its strategy is different
Test outcomes diverge: two strategies fix the bug; one backfires

Build the snapshot

Build the parent rootfs

cd recipes/coding-agent
sudo bash build.sh

Builds a python:3.12 image with the full dev toolchain. Rootfs: ~1.8 GB. Allow ~5 minutes the first time.

Set up host networking

sudo bash scripts/host-tap.sh
sudo bash scripts/netns-setup.sh 50

Snapshot the parent

sudo forkd snapshot --tag swe \
    --kernel ./vmlinux-6.1.141 \
    --rootfs recipes/coding-agent/parent.ext4 \
    --tap forkd-tap0

Fork parallel workspaces

sudo -E forkd fork --tag swe -n 50 --per-child-netns --memory-limit-mib 512

What’s in the snapshot

Every fork inherits the following dev tools, already installed and on PATH:

Tool	Version	Use
`python3` + `pip`	3.12	Runtime and package installer
`git`	system	Repository checkout
`gh`	latest	GitHub CLI for API access
`build-essential` + `make`	system	C extension compilation
`ruff`	pinned	Fast Python linter
`black`	pinned	Code formatter
`mypy`	pinned	Static type checker
`pytest`	pinned	Test runner
`requests`	pinned	HTTP client

The BRANCH pattern from `coding-agent-fork/`

The three fix strategies applied by grandchildren after the BRANCH:

Strategy	What it does	Test outcome
`minimal`	One-line `sed` to flip `a - b` → `a + b`	✅ Tests pass
`rewrite`	Full function rewrite with type-checks	✅ Tests pass
`skip`	Decorate tests with `@unittest.expectedFailure`	❌ Failed — `test_add_zero` (`0 - 0 == 0`) unexpectedly passes, breaking the contract

The pedagogical bonus: the lazy skip strategy backfires because test_add_zero happens to pass despite the bug — unittest flags this as an “unexpected success” and fails the suite. This is the kind of emergent behavior that branch-and-compare reveals and a simple parallel API call cannot.

Results from the real run (2026-05-19)

branch pause:     3.3 s  (SATA SSD, 513 MiB memory image)
grandchildren:    3
strategies:       minimal (✅), rewrite (✅), skip (❌)
vendored.bin md5: identical across all 4 agents
__pycache__/:     byte-identical at branch; diverges post-fix per child

The 3.3 s pause is a SATA SSD measurement. On tmpfs-backed snapshot storage the same operation records ~163 ms. See bench/pause-window/RESULTS-v0.3.md for the full curve and the v0.4 live-BRANCH path that cuts source pause to 56 ms p50.

Run the BRANCH demo

export FORKD_URL=http://127.0.0.1:8889
export FORKD_TOKEN=$(cat /etc/forkd/token)
bash recipes/coding-agent-fork/demo.sh

Artifacts land in recipes/coding-agent-fork/results/<unix-ts>/:

File	Contents
`summary.md`	Per-agent state evidence + divergent code side-by-side
`summary.json`	Machine-readable version of the summary
`branch.json`	Daemon’s BRANCH response including `pause_ms`
`state-evidence.txt`	Raw md5 hashes proving byte-identity
`{source,minimal,rewrite,skip}-init-py.txt`	Each agent’s `mathy/__init__.py` after their strategy
`{source,minimal,rewrite,skip}-agent.log`	Full per-agent shell log including unittest output

Key takeaway: bytes can’t fit in a prompt

To run 3 fix attempts via API-only parallelism, each request would need to carry the entire /workspace directory — source files, binary cache, populated __pycache__, and the 50 MiB vendored.bin. That’s:

Technically impossible above ~50 KiB on most LLM APIs
Meaningless for binary blobs — the LLM doesn’t understand them
Wasteful — 3× the bytes transferred, 3× the context tokens

forkd’s BRANCH primitive sidesteps this entirely. Children inherit the source’s address space copy-on-write. The 50 MiB blob appears in each child the moment they’re spawned — no transfer, no re-download, no re-computation.

State meant to be shared across forks must live in /tmp (tmpfs), not on the rootfs ext4. The rootfs file is shared (loop-mounted) across all sandboxes from the same snapshot — concurrent writes from multiple children to the same on-disk inode would corrupt the journal. Always use /tmp for writable agent workspace in forkd recipes.

Get Started

Guides

Recipes

Operations

Fork sandboxes for parallel coding agent evaluations

Two patterns

`coding-agent/` — fork-per-task evaluation harness

`coding-agent-fork/` — BRANCH distributes large binary state

Build the snapshot

What’s in the snapshot

The BRANCH pattern from `coding-agent-fork/`

Results from the real run (2026-05-19)

Run the BRANCH demo

Key takeaway: bytes can’t fit in a prompt

Build docs developers (and LLMs) love

Get Started

Guides

Recipes

Operations

Documentation Index

​Two patterns

​coding-agent/ — fork-per-task evaluation harness

​coding-agent-fork/ — BRANCH distributes large binary state

​Build the snapshot

​What’s in the snapshot

​The BRANCH pattern from coding-agent-fork/

​Results from the real run (2026-05-19)

​Run the BRANCH demo

​Key takeaway: bytes can’t fit in a prompt

Build docs developers (and LLMs) love

Two patterns

`coding-agent/` — fork-per-task evaluation harness

`coding-agent-fork/` — BRANCH distributes large binary state

Build the snapshot

What’s in the snapshot

The BRANCH pattern from `coding-agent-fork/`

Results from the real run (2026-05-19)

Run the BRANCH demo

Key takeaway: bytes can’t fit in a prompt