How forkd works: warm-fork, copy-on-write, and BRANCH

forkd’s core primitive is fork-from-warm: boot a Firecracker microVM once, load it with your runtime (Python + dependencies, a JIT-warmed JVM, a pre-loaded ML model), pause it to disk as a snapshot, then restore N independent children from that snapshot. Each child mmaps the parent’s memory.bin file with MAP_PRIVATE; the Linux kernel implements copy-on-write at the 4 KiB page level, so children share every unmodified resident page until they actually write to it. The result is KVM-grade hardware isolation at a per-child memory overhead of 0.12 MiB on a 512 MiB Python + numpy parent.

The warm-fork lifecycle

Boot

The parent Firecracker process is started with a BootConfig specifying the guest kernel, rootfs, vCPU count, and memory size. Firecracker calls InstanceStart via its Unix-socket API, the guest kernel boots, and PID 1 (forkd-init.sh) mounts pseudo-filesystems, fixes DNS, and launches the guest agent (forkd-agent.py) on TCP port 8888.This boot happens once per snapshot tag — never per child.

Warmup

After boot, user-space warm-up runs inside the VM. For a Python parent this means import numpy, import torch, or whichever libraries your agent workload needs. Any work the parent does — JIT compilation, model weight loading, disk prefetch — lands in resident RAM pages that every future child will inherit without paying the cost again.

Pause

forkd snapshot issues a PATCH /vm {"state": "Paused"} to the parent’s Firecracker socket. The guest vCPUs are halted; the VM is frozen in a deterministic state. The parent process keeps running but is no longer executing guest instructions.

Snapshot to disk

Firecracker’s PUT /snapshot/create writes two files to the snapshot directory:

memory.bin — the full guest physical memory image, one contiguous file.
vmstate — serialised vCPU register state, device state, and metadata.

These two files are the durable, reusable artifact. Snapshot creation takes a few seconds the first time; after that, forkd pull/forkd pack can ship the snapshot as a single .tar.zst file (typically 23× compression — a 512 MiB memory.bin becomes ~22 MiB on disk).

Restore N children (fork-out)

Snapshot::restore_many_with spawns N Firecracker processes in parallel. Each one receives:

PUT /snapshot/load with mem_backend.backend_type: "File" and MEMORY_LOAD_PRIVATE — Firecracker calls mmap(memory.bin, MAP_PRIVATE), not read(). The kernel maps the file pages into the child’s address space but does not copy them. All N children point at the same physical pages.
Placement into a dedicated cgroup v2 leaf (/sys/fs/cgroup/forkd/child-N/) with memory.max set to the configured quota.
Assignment to a pre-provisioned network namespace (forkd-child-N) with its own tap device, IP stack, and veth pair to the shared forkd-br0 bridge.

Wall-clock time for N=100 children on a 20 vCPU host: ~101 ms.

Copy-on-write memory model

After restore, each child has a MAP_PRIVATE mapping of memory.bin. The Linux page cache holds the physical pages backing that file. When child processes read a page, the kernel services the fault from the shared backing store — no copy, no additional memory. When a child writes to a page for the first time, the kernel:

Allocates a new physical page for that child.
Copies the original page contents into the new page (copy-on-write fault).
Remaps the child’s virtual address to the new private page.

From this point the child owns that page independently; the original backing page remains shared among all other children that haven’t yet written to it. Measured overhead:

Metric	Value
Host memory delta per child (N=100, 512 MiB Python+numpy parent)	0.12 MiB
Firecracker process resident size before any guest state	~5 MiB
Wall-clock to fork N=100 children	101 ms

The 0.12 MiB per-child overhead figure covers only the pages that diverged during the measurement workload (import numpy; numpy.zeros(5).tolist()). Heavier agents will diverge more pages over time — the parent’s resident size sets the upper bound, not the lower bound. vCPU count and process count dominate capacity planning before memory does on typical workloads.

The parent’s memory.bin is written to ext4 by default. The host page cache backs hot pages; with hugepages provisioned (512 × 2 MiB pages, per scripts/setup-host.sh), the kernel can back hot regions with 2 MiB TLB entries, reducing page-table pressure at high N.

BRANCH: forking a live running VM

BRANCH is the inverse of the warmup snapshot — instead of snapshotting a freshly booted parent, you snapshot a running sandbox mid-execution, then resume the source and fork children from the new snapshot. An agent can branch mid-thought: three children each receive a different steering hint while inheriting the same prior reasoning state and filesystem. forkd offers three BRANCH modes with different pause-window tradeoffs:

Mode	Source pause window	Total BRANCH API time	Notes
Full	29+ s (bandwidth-bound copy of full `memory.bin`)	29+ s	Baseline; not recommended for running agents
Diff	~150–205 ms (only dirty pages are diffed)	bandwidth-bound on background cp	Source pauses ~200 ms; background merge runs in parallel
Live (v0.4)	56 ms p50 / 64 ms p90	~70 ms with `wait: false`	Source pauses sub-50 ms; background copy is disk-independent

The Diff mode improvement over Full mode is 143× on a 4 GiB SSD source at idle (29.3 s → 205 ms). For a typical agent workload with 30–300 MiB of dirty pages, the reduction is 6–15×. v0.3.4 fixed a multi-BRANCH pause anomaly where repeated BRANCHes on the same parent ballooned to 2.7 s; a 30-line posix_fallocate fix keeps consecutive BRANCHes flat at ~150 ms (17.6× faster on the 6th consecutive BRANCH).

How Live BRANCH works (v0.4): The source VM must be spawned with live_fork: true, which backs guest RAM with a memfd shared between the Firecracker process and the controller. When BRANCH is issued:

The controller installs a UFFD_WP (userfaultfd write-protect) watch on the shared memfd — dirty pages will be captured out-of-band.
vCPUs are halted and the vmstate is dumped. The source’s pause window ends here (~56 ms p50).
The source vCPUs resume immediately.
In the background, dirty pages tracked by UFFD_WP are copied into the new snapshot’s memory image. The copy runs asynchronously — disk I/O does not extend the source’s downtime.

With wait: false, the BRANCH API call returns after ~10 ms (as soon as the source has resumed). Poll list_snapshots until status: "ready" to know when the background copy is complete and children can be forked from the new tag.

Live BRANCH requires the vendored Firecracker fork at deeplethe/firecracker:forkd-v0.4-mem-backend-shared-v1.12. This is because mem_backend.backend_type: "Shared" with shared: true is the one gap that couldn’t be worked around without mmap MAP_SHARED in vanilla upstream Firecracker. An upstream proposal is open; once it lands, the vendor requirement goes away.

Architecture components

forkd is composed of four cooperating components:

`forkd` CLI

The forkd binary is the operator’s interface. Key verbs:

Verb	What it does
`forkd quickstart`	One-command preflight + snapshot + fork
`forkd doctor`	16-check host diagnostic with fix hints
`forkd snapshot`	Boot a parent, warm it, pause, write `memory.bin` + `vmstate`
`forkd fork`	Restore N children from a snapshot tag
`forkd bench`	Measure spawn, exec, BRANCH, and fan-out latency against a tag
`forkd pack` / `forkd unpack`	Bundle a snapshot (+ chain ancestors) into a single `.tar.zst`
`forkd push` / `forkd pull`	Publish to or fetch from the Snapshot Hub
`forkd images`	List local snapshots with sizes
`forkd snapshot-diff`	Build a diff-snapshot layer on top of an existing tag (v0.5)
`forkd snapshot-compact`	Flatten a deep diff chain into a single layer

`forkd-controller` daemon

The controller is a long-running daemon that owns the authoritative state of all snapshots and live sandboxes. It exposes:

REST API on 127.0.0.1:8889 (or a configured address) — POST /v1/sandboxes, GET /v1/sandboxes, POST /v1/sandboxes/:id/branch, DELETE /v1/sandboxes/:id.
Bearer-token auth via a secret loaded from /etc/forkd/token. Constant-time comparison guards against timing oracles.
Prometheus /metrics — forkd_snapshots_total, forkd_sandboxes_active, forkd_build_info.
Append-only JSON audit log at /var/log/forkd/audit.log — one JSON-Lines object per request with RFC3339 timestamp, method, path, status, latency in microseconds, and user-agent.
Graceful shutdown and reconciliation on restart (prunes sandbox entries whose Firecracker PID is gone from /proc).

Run as a systemd service:

sudo install -m 0644 packaging/systemd/forkd-controller.service /etc/systemd/system/
sudo mkdir -p /etc/forkd
sudo bash -c 'head -c 32 /dev/urandom | base64 > /etc/forkd/token'
sudo chmod 600 /etc/forkd/token
sudo systemctl enable --now forkd-controller

`forkd-vmm` library

The forkd-vmm crate is the Firecracker wrapper that all other components build on. It provides:

BootConfig — typed configuration for kernel, rootfs, vCPU count, memory, and network.
Vm — lifecycle methods: boot, pause, snapshot_to, resume, kill.
Snapshot — restore_many_with(n, opts) spawns N Firecracker processes in parallel, each with its own MAP_PRIVATE memory mapping.
ForkOpts — controls memory_limit_mib, per_child_netns, live_fork, and related options.
cgroup helpers — creates and populates cgroup v2 leaves under /sys/fs/cgroup/forkd/child-N/.
Network namespace plumbing — setns(2) into forkd-child-N for each child’s agent communication.
Raw HTTP/1.1 over Unix socket — typed wrappers around every Firecracker API endpoint without pulling in an async HTTP client.

Guest agent (`forkd-agent.py`)

forkd-agent.py is a minimal TCP server running on port 8888 inside every sandbox (parent and child alike). It handles three message types:

ping — health check; the controller uses this to confirm a child is alive after restore.
exec — run a shell command, capture stdout/stderr/exit-code.
eval — evaluate a Python expression in the already-running interpreter (PID 1’s namespace). This is the ~1 ms path vs ~96 ms for a cold subprocess.

forkd-init.sh (PID 1) mounts pseudo-filesystems, fixes /etc/resolv.conf, and launches forkd-agent.py before entering an idle wait loop.

System requirements

Requirement	Minimum	Recommended	Notes
Kernel	Linux 5.7	Linux 5.20+	5.20+ for automatic RNG re-seed via `vmgenid`
Architecture	x86_64	x86_64	aarch64 is tracked; not yet tested in CI
KVM	`/dev/kvm` present + writable	bare-metal	Nested virt works but adds overhead
cgroup	v2 unified hierarchy	v2	`mount -t cgroup2` must succeed
Firecracker	v1.7+	v1.10.1	v0.4 live-fork requires vendored fork
Network	iproute2, iptables	+ bridge-utils	tap + veth + MASQUERADE rule on `forkd-br0`
Per-child netns	`scripts/netns-setup.sh N`	—	One named netns per child, pre-provisioned
`uffd_wp`	`vm.unprivileged_userfaultfd=1` or `CAP_SYS_PTRACE`	—	Required for v0.4 live BRANCH only

All children must live on the same host as their parent snapshot. The MAP_PRIVATE CoW primitive requires the parent’s memory.bin to be accessible in the host’s page cache. Multi-host scheduling is a v1.x roadmap item.

Get Started

Guides

Recipes

Operations

How forkd works: warm-fork, copy-on-write, and BRANCH

The warm-fork lifecycle

Copy-on-write memory model

BRANCH: forking a live running VM

Architecture components

`forkd` CLI

`forkd-controller` daemon

`forkd-vmm` library

Guest agent (`forkd-agent.py`)

System requirements

System architecture diagram

Build docs developers (and LLMs) love

Get Started

Guides

Recipes

Operations

Documentation Index

​The warm-fork lifecycle

​Copy-on-write memory model

​BRANCH: forking a live running VM

​Architecture components

​forkd CLI

​forkd-controller daemon

​forkd-vmm library

​Guest agent (forkd-agent.py)

​System requirements

​System architecture diagram

Build docs developers (and LLMs) love

The warm-fork lifecycle

Copy-on-write memory model

BRANCH: forking a live running VM

Architecture components

`forkd` CLI

`forkd-controller` daemon

`forkd-vmm` library

Guest agent (`forkd-agent.py`)

System requirements

System architecture diagram