forkd operator runbook: bring-up, monitoring, and recovery

This runbook covers everything from the first host configuration through daily monitoring and the recovery procedures you will reach for when something goes wrong. Work through the bring-up section once per host; keep the failure-mode section bookmarked for on-call response. All commands assume a single-host deployment where the daemon binds on loopback (127.0.0.1:8889) — adapt addresses if you have enabled non-loopback TLS.

1. Bring-up

Host prerequisites

Before installing the daemon, confirm the host meets the following requirements:

x86_64 Linux, kernel 5.10 or newer (5.20+ recommended — vmgenid for automatic per-child RNG re-seed ships in 5.20)
KVM available: /dev/kvm present, and your user in the kvm group
Firecracker binary v1.7+ on $PATH — use the vendored fork from deeplethe/firecracker:forkd-v0.4-mem-backend-shared-v1.12 for v0.4 live-fork support; forkd doctor checks the version and emits a specific fix hint if it’s wrong
cgroup v2 unified hierarchy (mount -t cgroup2 should show at least one mountpoint)
iproute2 installed

Run forkd doctor after completing the steps below — it runs 16 checks and emits specific fix hints for each non-pass.

One-shot install

Host setup: KVM, Firecracker, network

sudo bash scripts/setup-host.sh          # KVM, Firecracker, KSM tuning
sudo bash scripts/netns-setup.sh 100     # provision 100 per-child netns

Build and install binaries

cargo build --release
sudo install -m 0755 target/release/forkd-controller /usr/local/bin/
sudo install -m 0755 target/release/forkd            /usr/local/bin/

Alternatively, extract the pre-built tarball from the GitHub releases page:

curl -sSL https://github.com/deeplethe/forkd/releases/download/v0.5.2/forkd-v0.5.2-x86_64-linux.tar.gz \
  | sudo tar -xz -C /usr/local/bin/

Install the systemd unit

sudo install -m 0644 packaging/systemd/forkd-controller.service /etc/systemd/system/

Create directories and generate the bearer token

sudo mkdir -p /etc/forkd /var/lib/forkd /var/log/forkd
sudo bash -c 'head -c 32 /dev/urandom | base64 > /etc/forkd/token'
sudo chmod 600 /etc/forkd/token

(Optional) Configure TLS

Required for any non-loopback bind. Drop a cert.pem and key.pem from Let’s Encrypt or your internal CA into /etc/forkd/tls/ and add --tls-cert / --tls-key to the ExecStart line in the systemd unit before the next step.

sudo mkdir -p /etc/forkd/tls
# place cert.pem + key.pem here, then edit the unit:
sudo systemctl edit forkd-controller

Enable and start the daemon

sudo systemctl daemon-reload
sudo systemctl enable --now forkd-controller

Verify

curl http://127.0.0.1:8889/healthz
# {"ok":true}

curl http://127.0.0.1:8889/metrics
# forkd_sandboxes_active 0

2. Daily operations

Auth: token rotation

The daemon reads the bearer token once at startup from /etc/forkd/token. Rotate by writing a new value and restarting:

sudo bash -c 'head -c 32 /dev/urandom | base64 > /etc/forkd/token'
sudo chmod 600 /etc/forkd/token
sudo systemctl restart forkd-controller

Existing sandboxes survive the restart; in-flight HTTP requests do not. Update any SDK clients or CI secrets that hold the old token value before restarting.

Metrics: what to scrape and suggested alerts

Scrape :8889/metrics from Prometheus. With a loopback-only bind you may not need a credential on the scrape; with a non-loopback bind, pass the bearer token as a Bearer header in your Prometheus scrape_configs. Stable metric names that should always be present:

Metric	Description
`forkd_build_info`	Always `1`. Labels carry `version`. Absence means the daemon is down.
`forkd_snapshots_total`	Number of registered snapshots.
`forkd_sandboxes_active`	Number of live child Firecracker processes.

Suggested alerts:

- alert: ForkdDaemonDown
  expr: absent(forkd_build_info) == 1
  for: 1m
  annotations:
    summary: "forkd-controller is not reachable — no scrape for 1 minute"

- alert: ForkdSandboxSaturation
  expr: forkd_sandboxes_active > (count(node_cpu_seconds_total{mode="idle"}) * 0.8)
  for: 5m
  annotations:
    summary: "forkd sandbox count > 80% of host vCPU count for 5 minutes"

Audit log: format and rotation

The daemon appends one JSON line per request to /var/log/forkd/audit.log:

{"ts":"2026-05-12T07:12:34Z","method":"POST","path":"/v1/sandboxes","status":201,"latency_us":98342,"ua":"forkd-cli/0.1"}

Rotate with logrotate. The daemon does not yet re-open the file on SIGHUP — after a rotation, restart the daemon:

sudo systemctl restart forkd-controller

A sample logrotate stanza:

/var/log/forkd/audit.log {
    daily
    rotate 30
    compress
    missingok
    postrotate
        systemctl restart forkd-controller
    endscript
}

3. Failure modes and recovery

Daemon won’t start: “bind 127.0.0.1:8889” fails

Another process is already on port 8889. Find and stop it:

ss -ltnp | grep 8889
# If a stale daemon is listed:
pkill -f forkd-controller
sudo systemctl start forkd-controller

`POST /v1/sandboxes` returns 500 “restore_many: …”

This error almost always means one of: Snapshot / kernel mismatch: the snapshot was created against a different Firecracker kernel version than what is available on this host. Re-create the snapshot:

forkd images                    # list registered snapshots
forkd rmi <tag>                 # remove the stale snapshot
# Rebuild from your Docker image or kernel + rootfs
sudo -E forkd from-image python:3.12-slim --tag pyagent

Out of disk on /tmp: each child needs ~5 MiB of work-dir space. Check:

df -h /tmp
forkd cleanup --yes             # prune stale work dirs

Out of memory: the host has hit memory.max on /sys/fs/cgroup/forkd/ or simply has no free RAM. Check:

cat /proc/meminfo | grep -E 'MemAvailable|MemFree'
cat /sys/fs/cgroup/forkd/memory.current

Reduce memory_limit_mib per child, or reduce the number of concurrently active sandboxes.

Daemon not responding

Run forkd doctor

forkd doctor

forkd doctor runs 16 checks including KVM availability, Firecracker binary version, cgroup v2 mount, netns provisioning, snapshot dir disk space, and controller reachability. Review every non-green line for a specific fix hint.

Check systemd status and logs

sudo systemctl status forkd-controller
sudo journalctl -u forkd-controller -n 100 --no-pager

Restart the daemon

sudo systemctl restart forkd-controller
curl http://127.0.0.1:8889/healthz

On startup, the daemon runs reconcile() which prunes registry entries whose Firecracker PID is no longer alive in /proc. Surviving snapshots remain registered.

Children are alive but exec/eval times out

Most likely a network namespace mismatch. With per_child_netns: true, each child’s in-guest agent is reachable only from within its own netns. The daemon does the setns(2) on your behalf, but the host must have /var/run/netns/forkd-child-<i> provisioned. Re-run the netns setup script if you restarted networking or rebooted:

sudo bash scripts/netns-setup.sh 100   # re-provision for N=100 children
sudo systemctl restart forkd-controller

Child OOM: sandbox killed by kernel

A child sandbox was killed because its guest hit memory.max. Diagnose:

# Check the cgroup accounting for recent OOM events
sudo dmesg | grep -i oom | tail -20
sudo cat /sys/fs/cgroup/forkd/memory.events

Increase memory_limit_mib when spawning, or reduce the number of concurrent children. The parent snapshot’s resident size is shared via CoW (measured at 0.12 MiB overhead per child at N=100) — the OOM is almost always caused by write amplification inside the guest, not the shared base.

Snapshot directory full

forkd images          # list snapshots with sizes
forkd rmi <tag>       # delete a snapshot
# For a chain with dependents, use --cascade:
forkd rmi <tag> --cascade
# Prune stale work dirs:
forkd cleanup --yes

Check disk usage on the snapshot root:

du -sh /var/lib/forkd/snapshots/*
df -h /var/lib/forkd

Stale work dirs accumulate

Child work dirs in /tmp (or the daemon’s configured scratch path) can accumulate if sandboxes are killed without a clean shutdown. Prune with:

forkd cleanup --yes

Firecracker binary mismatch

If forkd doctor reports a Firecracker version mismatch:

firecracker --version
# Compare against what forkd expects (v1.7+ for base; deeplethe fork for v0.4 live-fork)

Replace the binary with the correct version and re-run forkd doctor to confirm.

Sandboxes lost after host reboot

Firecracker processes do not survive host reboots. The daemon’s state file (/var/lib/forkd/state.json) persists, but on startup reconcile() checks /proc/<pid> for each registered sandbox and prunes entries whose process is gone. Snapshots (the paused disk images) do survive reboots and can be used to fork new children immediately. Create new sandboxes from the existing snapshot tags:

curl -s \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -X POST http://127.0.0.1:8889/v1/sandboxes \
  -d '{"snapshot_tag":"pyagent","n":10,"per_child_netns":true}'

4. Upgrade procedure

Stop the daemon

sudo systemctl stop forkd-controller

Live sandboxes are killed. The snapshot registry on disk survives.

Install the new binary

# From pre-built tarball:
curl -sSL https://github.com/deeplethe/forkd/releases/download/v0.5.2/forkd-v0.5.2-x86_64-linux.tar.gz \
  | sudo tar -xz -C /usr/local/bin/

# Or from source:
cargo build --release
sudo install -m 0755 target/release/forkd-controller /usr/local/bin/
sudo install -m 0755 target/release/forkd            /usr/local/bin/

Start the daemon

sudo systemctl start forkd-controller

reconcile() runs on startup and prunes any sandbox entries whose Firecracker PID is no longer alive.

Verify the new version

curl http://127.0.0.1:8889/version
# {"version":"0.5.2","api":"v1"}

5. Backup: snapshot portability with `forkd pack`

Snapshot directories are self-contained and portable. Pack any registered snapshot into a single .tar.zst archive that bundles the vmstate, memory image, and a per-file sha256 manifest:

# On the source host:
forkd pack --tag pyagent --out pyagent.forkd-snapshot.tar.zst
# Typical compression: 23× (512 MiB memory.bin → ~22 MiB on disk)

# Transfer to another host, then restore:
forkd unpack pyagent.forkd-snapshot.tar.zst
# Integrity is verified via the manifest's sha256s on unpack.

# Fork immediately on the destination host:
sudo -E forkd fork --tag pyagent -n 10 --per-child-netns

For chains built with diff snapshots (v0.5), forkd pack bundles every ancestor in the chain into one tarball:

forkd pack --tag py-pandas --out py-pandas-chain.tar.zst
forkd unpack py-pandas-chain.tar.zst   # restores all chain links

Back up snapshot directories to object storage (S3, R2, GCS) using forkd push and restore with forkd pull. Integrity is verified on unpack — snapshots can be treated as immutable artefacts.

6. Capacity planning

On a 20 vCPU / 30 GiB host, forkd has been exercised at N=200 children sharing one snapshot in ~750 ms wall-clock. Sustained throughput depends on:

KSM enabled and tuned (scripts/setup-host.sh sets sensible defaults)
Parent snapshot memory image size — smaller parents fork faster
per_child_netns — adds ~3 ms per child for netns setup and agent probe round-trip
Snapshot chain depth — each additional link in a v0.5 diff chain adds ~450 ms to spawn time on ext4 (SHA-256 verification of the base layer)

Run bench/bench-spawn-100.sh on your own host to get hardware-specific numbers. For per-Pod sizing on Kubernetes, see the Enterprise deployment FAQ:

~1 actively-running agent per vCPU (compute-bound)
~50 idle-pooled agents per 8 GiB Pod RAM (process-state bottleneck)
0.12 MiB CoW overhead per child at N=100 — memory rarely caps fan-out

Get Started

Guides

Recipes

Operations

forkd operator runbook: bring-up, monitoring, and recovery

1. Bring-up

Host prerequisites

One-shot install

2. Daily operations

Auth: token rotation

Metrics: what to scrape and suggested alerts

Audit log: format and rotation

3. Failure modes and recovery

Daemon won’t start: “bind 127.0.0.1:8889” fails

`POST /v1/sandboxes` returns 500 “restore_many: …”

Daemon not responding

Children are alive but exec/eval times out

Child OOM: sandbox killed by kernel

Snapshot directory full

Stale work dirs accumulate

Firecracker binary mismatch

Sandboxes lost after host reboot

4. Upgrade procedure

5. Backup: snapshot portability with `forkd pack`

6. Capacity planning

Build docs developers (and LLMs) love

Get Started

Guides

Recipes

Operations

Documentation Index

​1. Bring-up

​Host prerequisites

​One-shot install

​2. Daily operations

​Auth: token rotation

​Metrics: what to scrape and suggested alerts

​Audit log: format and rotation

​3. Failure modes and recovery

​Daemon won’t start: “bind 127.0.0.1:8889” fails

​POST /v1/sandboxes returns 500 “restore_many: …”

​Daemon not responding

​Children are alive but exec/eval times out

​Child OOM: sandbox killed by kernel

​Snapshot directory full

​Stale work dirs accumulate

​Firecracker binary mismatch

​Sandboxes lost after host reboot

​4. Upgrade procedure

​5. Backup: snapshot portability with forkd pack

​6. Capacity planning

Build docs developers (and LLMs) love

1. Bring-up

Host prerequisites

One-shot install

2. Daily operations

Auth: token rotation

Metrics: what to scrape and suggested alerts

Audit log: format and rotation

3. Failure modes and recovery

Daemon won’t start: “bind 127.0.0.1:8889” fails

`POST /v1/sandboxes` returns 500 “restore_many: …”

Daemon not responding

Children are alive but exec/eval times out

Child OOM: sandbox killed by kernel

Snapshot directory full

Stale work dirs accumulate

Firecracker binary mismatch

Sandboxes lost after host reboot

4. Upgrade procedure

5. Backup: snapshot portability with `forkd pack`

6. Capacity planning