This runbook covers everything from the first host configuration through daily monitoring and the recovery procedures you will reach for when something goes wrong. Work through the bring-up section once per host; keep the failure-mode section bookmarked for on-call response. All commands assume a single-host deployment where the daemon binds on loopback (Documentation Index
Fetch the complete documentation index at: https://mintlify.com/deeplethe/forkd/llms.txt
Use this file to discover all available pages before exploring further.
127.0.0.1:8889) — adapt addresses if you have enabled non-loopback TLS.
1. Bring-up
Host prerequisites
Before installing the daemon, confirm the host meets the following requirements:- x86_64 Linux, kernel 5.10 or newer (5.20+ recommended —
vmgenidfor automatic per-child RNG re-seed ships in 5.20) - KVM available:
/dev/kvmpresent, and your user in thekvmgroup - Firecracker binary v1.7+ on
$PATH— use the vendored fork fromdeeplethe/firecracker:forkd-v0.4-mem-backend-shared-v1.12for v0.4 live-fork support;forkd doctorchecks the version and emits a specific fix hint if it’s wrong - cgroup v2 unified hierarchy (
mount -t cgroup2should show at least one mountpoint) - iproute2 installed
forkd doctor after completing the steps below — it runs 16 checks and emits specific fix hints for each non-pass.
One-shot install
Build and install binaries
(Optional) Configure TLS
Required for any non-loopback bind. Drop a
cert.pem and key.pem from Let’s Encrypt or your internal CA into /etc/forkd/tls/ and add --tls-cert / --tls-key to the ExecStart line in the systemd unit before the next step.2. Daily operations
Auth: token rotation
The daemon reads the bearer token once at startup from/etc/forkd/token. Rotate by writing a new value and restarting:
Existing sandboxes survive the restart; in-flight HTTP requests do not. Update any SDK clients or CI secrets that hold the old token value before restarting.
Metrics: what to scrape and suggested alerts
Scrape:8889/metrics from Prometheus. With a loopback-only bind you may not need a credential on the scrape; with a non-loopback bind, pass the bearer token as a Bearer header in your Prometheus scrape_configs.
Stable metric names that should always be present:
| Metric | Description |
|---|---|
forkd_build_info | Always 1. Labels carry version. Absence means the daemon is down. |
forkd_snapshots_total | Number of registered snapshots. |
forkd_sandboxes_active | Number of live child Firecracker processes. |
Audit log: format and rotation
The daemon appends one JSON line per request to/var/log/forkd/audit.log:
logrotate. The daemon does not yet re-open the file on SIGHUP — after a rotation, restart the daemon:
logrotate stanza:
3. Failure modes and recovery
Daemon won’t start: “bind 127.0.0.1:8889” fails
Another process is already on port 8889. Find and stop it:POST /v1/sandboxes returns 500 “restore_many: …”
This error almost always means one of:
Snapshot / kernel mismatch: the snapshot was created against a different Firecracker kernel version than what is available on this host. Re-create the snapshot:
/tmp: each child needs ~5 MiB of work-dir space. Check:
memory.max on /sys/fs/cgroup/forkd/ or simply has no free RAM. Check:
memory_limit_mib per child, or reduce the number of concurrently active sandboxes.
Daemon not responding
Run forkd doctor
forkd doctor runs 16 checks including KVM availability, Firecracker binary version, cgroup v2 mount, netns provisioning, snapshot dir disk space, and controller reachability. Review every non-green line for a specific fix hint.Children are alive but exec/eval times out
Most likely a network namespace mismatch. Withper_child_netns: true, each child’s in-guest agent is reachable only from within its own netns. The daemon does the setns(2) on your behalf, but the host must have /var/run/netns/forkd-child-<i> provisioned. Re-run the netns setup script if you restarted networking or rebooted:
Child OOM: sandbox killed by kernel
A child sandbox was killed because its guest hitmemory.max. Diagnose:
memory_limit_mib when spawning, or reduce the number of concurrent children. The parent snapshot’s resident size is shared via CoW (measured at 0.12 MiB overhead per child at N=100) — the OOM is almost always caused by write amplification inside the guest, not the shared base.
Snapshot directory full
Stale work dirs accumulate
Child work dirs in/tmp (or the daemon’s configured scratch path) can accumulate if sandboxes are killed without a clean shutdown. Prune with:
Firecracker binary mismatch
Ifforkd doctor reports a Firecracker version mismatch:
forkd doctor to confirm.
Sandboxes lost after host reboot
Firecracker processes do not survive host reboots. The daemon’s state file (/var/lib/forkd/state.json) persists, but on startup reconcile() checks /proc/<pid> for each registered sandbox and prunes entries whose process is gone. Snapshots (the paused disk images) do survive reboots and can be used to fork new children immediately. Create new sandboxes from the existing snapshot tags:
4. Upgrade procedure
Start the daemon
reconcile() runs on startup and prunes any sandbox entries whose Firecracker PID is no longer alive.5. Backup: snapshot portability with forkd pack
Snapshot directories are self-contained and portable. Pack any registered snapshot into a single .tar.zst archive that bundles the vmstate, memory image, and a per-file sha256 manifest:
forkd pack bundles every ancestor in the chain into one tarball:
6. Capacity planning
On a 20 vCPU / 30 GiB host, forkd has been exercised at N=200 children sharing one snapshot in ~750 ms wall-clock. Sustained throughput depends on:- KSM enabled and tuned (
scripts/setup-host.shsets sensible defaults) - Parent snapshot memory image size — smaller parents fork faster
per_child_netns— adds ~3 ms per child for netns setup and agent probe round-trip- Snapshot chain depth — each additional link in a v0.5 diff chain adds ~450 ms to spawn time on ext4 (SHA-256 verification of the base layer)
bench/bench-spawn-100.sh on your own host to get hardware-specific numbers. For per-Pod sizing on Kubernetes, see the Enterprise deployment FAQ:
- ~1 actively-running agent per vCPU (compute-bound)
- ~50 idle-pooled agents per 8 GiB Pod RAM (process-state bottleneck)
- 0.12 MiB CoW overhead per child at N=100 — memory rarely caps fan-out