Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/deeplethe/forkd/llms.txt

Use this file to discover all available pages before exploring further.

This runbook covers everything from the first host configuration through daily monitoring and the recovery procedures you will reach for when something goes wrong. Work through the bring-up section once per host; keep the failure-mode section bookmarked for on-call response. All commands assume a single-host deployment where the daemon binds on loopback (127.0.0.1:8889) — adapt addresses if you have enabled non-loopback TLS.

1. Bring-up

Host prerequisites

Before installing the daemon, confirm the host meets the following requirements:
  • x86_64 Linux, kernel 5.10 or newer (5.20+ recommended — vmgenid for automatic per-child RNG re-seed ships in 5.20)
  • KVM available: /dev/kvm present, and your user in the kvm group
  • Firecracker binary v1.7+ on $PATH — use the vendored fork from deeplethe/firecracker:forkd-v0.4-mem-backend-shared-v1.12 for v0.4 live-fork support; forkd doctor checks the version and emits a specific fix hint if it’s wrong
  • cgroup v2 unified hierarchy (mount -t cgroup2 should show at least one mountpoint)
  • iproute2 installed
Run forkd doctor after completing the steps below — it runs 16 checks and emits specific fix hints for each non-pass.

One-shot install

1

Host setup: KVM, Firecracker, network

sudo bash scripts/setup-host.sh          # KVM, Firecracker, KSM tuning
sudo bash scripts/netns-setup.sh 100     # provision 100 per-child netns
2

Build and install binaries

cargo build --release
sudo install -m 0755 target/release/forkd-controller /usr/local/bin/
sudo install -m 0755 target/release/forkd            /usr/local/bin/
Alternatively, extract the pre-built tarball from the GitHub releases page:
curl -sSL https://github.com/deeplethe/forkd/releases/download/v0.5.2/forkd-v0.5.2-x86_64-linux.tar.gz \
  | sudo tar -xz -C /usr/local/bin/
3

Install the systemd unit

sudo install -m 0644 packaging/systemd/forkd-controller.service /etc/systemd/system/
4

Create directories and generate the bearer token

sudo mkdir -p /etc/forkd /var/lib/forkd /var/log/forkd
sudo bash -c 'head -c 32 /dev/urandom | base64 > /etc/forkd/token'
sudo chmod 600 /etc/forkd/token
5

(Optional) Configure TLS

Required for any non-loopback bind. Drop a cert.pem and key.pem from Let’s Encrypt or your internal CA into /etc/forkd/tls/ and add --tls-cert / --tls-key to the ExecStart line in the systemd unit before the next step.
sudo mkdir -p /etc/forkd/tls
# place cert.pem + key.pem here, then edit the unit:
sudo systemctl edit forkd-controller
6

Enable and start the daemon

sudo systemctl daemon-reload
sudo systemctl enable --now forkd-controller
7

Verify

curl http://127.0.0.1:8889/healthz
# {"ok":true}

curl http://127.0.0.1:8889/metrics
# forkd_sandboxes_active 0

2. Daily operations

Auth: token rotation

The daemon reads the bearer token once at startup from /etc/forkd/token. Rotate by writing a new value and restarting:
sudo bash -c 'head -c 32 /dev/urandom | base64 > /etc/forkd/token'
sudo chmod 600 /etc/forkd/token
sudo systemctl restart forkd-controller
Existing sandboxes survive the restart; in-flight HTTP requests do not. Update any SDK clients or CI secrets that hold the old token value before restarting.

Metrics: what to scrape and suggested alerts

Scrape :8889/metrics from Prometheus. With a loopback-only bind you may not need a credential on the scrape; with a non-loopback bind, pass the bearer token as a Bearer header in your Prometheus scrape_configs. Stable metric names that should always be present:
MetricDescription
forkd_build_infoAlways 1. Labels carry version. Absence means the daemon is down.
forkd_snapshots_totalNumber of registered snapshots.
forkd_sandboxes_activeNumber of live child Firecracker processes.
Suggested alerts:
- alert: ForkdDaemonDown
  expr: absent(forkd_build_info) == 1
  for: 1m
  annotations:
    summary: "forkd-controller is not reachable — no scrape for 1 minute"

- alert: ForkdSandboxSaturation
  expr: forkd_sandboxes_active > (count(node_cpu_seconds_total{mode="idle"}) * 0.8)
  for: 5m
  annotations:
    summary: "forkd sandbox count > 80% of host vCPU count for 5 minutes"

Audit log: format and rotation

The daemon appends one JSON line per request to /var/log/forkd/audit.log:
{"ts":"2026-05-12T07:12:34Z","method":"POST","path":"/v1/sandboxes","status":201,"latency_us":98342,"ua":"forkd-cli/0.1"}
Rotate with logrotate. The daemon does not yet re-open the file on SIGHUP — after a rotation, restart the daemon:
sudo systemctl restart forkd-controller
A sample logrotate stanza:
/var/log/forkd/audit.log {
    daily
    rotate 30
    compress
    missingok
    postrotate
        systemctl restart forkd-controller
    endscript
}

3. Failure modes and recovery

Daemon won’t start: “bind 127.0.0.1:8889” fails

Another process is already on port 8889. Find and stop it:
ss -ltnp | grep 8889
# If a stale daemon is listed:
pkill -f forkd-controller
sudo systemctl start forkd-controller

POST /v1/sandboxes returns 500 “restore_many: …”

This error almost always means one of: Snapshot / kernel mismatch: the snapshot was created against a different Firecracker kernel version than what is available on this host. Re-create the snapshot:
forkd images                    # list registered snapshots
forkd rmi <tag>                 # remove the stale snapshot
# Rebuild from your Docker image or kernel + rootfs
sudo -E forkd from-image python:3.12-slim --tag pyagent
Out of disk on /tmp: each child needs ~5 MiB of work-dir space. Check:
df -h /tmp
forkd cleanup --yes             # prune stale work dirs
Out of memory: the host has hit memory.max on /sys/fs/cgroup/forkd/ or simply has no free RAM. Check:
cat /proc/meminfo | grep -E 'MemAvailable|MemFree'
cat /sys/fs/cgroup/forkd/memory.current
Reduce memory_limit_mib per child, or reduce the number of concurrently active sandboxes.

Daemon not responding

1

Run forkd doctor

forkd doctor
forkd doctor runs 16 checks including KVM availability, Firecracker binary version, cgroup v2 mount, netns provisioning, snapshot dir disk space, and controller reachability. Review every non-green line for a specific fix hint.
2

Check systemd status and logs

sudo systemctl status forkd-controller
sudo journalctl -u forkd-controller -n 100 --no-pager
3

Restart the daemon

sudo systemctl restart forkd-controller
curl http://127.0.0.1:8889/healthz
On startup, the daemon runs reconcile() which prunes registry entries whose Firecracker PID is no longer alive in /proc. Surviving snapshots remain registered.

Children are alive but exec/eval times out

Most likely a network namespace mismatch. With per_child_netns: true, each child’s in-guest agent is reachable only from within its own netns. The daemon does the setns(2) on your behalf, but the host must have /var/run/netns/forkd-child-<i> provisioned. Re-run the netns setup script if you restarted networking or rebooted:
sudo bash scripts/netns-setup.sh 100   # re-provision for N=100 children
sudo systemctl restart forkd-controller

Child OOM: sandbox killed by kernel

A child sandbox was killed because its guest hit memory.max. Diagnose:
# Check the cgroup accounting for recent OOM events
sudo dmesg | grep -i oom | tail -20
sudo cat /sys/fs/cgroup/forkd/memory.events
Increase memory_limit_mib when spawning, or reduce the number of concurrent children. The parent snapshot’s resident size is shared via CoW (measured at 0.12 MiB overhead per child at N=100) — the OOM is almost always caused by write amplification inside the guest, not the shared base.

Snapshot directory full

forkd images          # list snapshots with sizes
forkd rmi <tag>       # delete a snapshot
# For a chain with dependents, use --cascade:
forkd rmi <tag> --cascade
# Prune stale work dirs:
forkd cleanup --yes
Check disk usage on the snapshot root:
du -sh /var/lib/forkd/snapshots/*
df -h /var/lib/forkd

Stale work dirs accumulate

Child work dirs in /tmp (or the daemon’s configured scratch path) can accumulate if sandboxes are killed without a clean shutdown. Prune with:
forkd cleanup --yes

Firecracker binary mismatch

If forkd doctor reports a Firecracker version mismatch:
firecracker --version
# Compare against what forkd expects (v1.7+ for base; deeplethe fork for v0.4 live-fork)
Replace the binary with the correct version and re-run forkd doctor to confirm.

Sandboxes lost after host reboot

Firecracker processes do not survive host reboots. The daemon’s state file (/var/lib/forkd/state.json) persists, but on startup reconcile() checks /proc/<pid> for each registered sandbox and prunes entries whose process is gone. Snapshots (the paused disk images) do survive reboots and can be used to fork new children immediately. Create new sandboxes from the existing snapshot tags:
curl -s \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -X POST http://127.0.0.1:8889/v1/sandboxes \
  -d '{"snapshot_tag":"pyagent","n":10,"per_child_netns":true}'

4. Upgrade procedure

1

Stop the daemon

sudo systemctl stop forkd-controller
Live sandboxes are killed. The snapshot registry on disk survives.
2

Install the new binary

# From pre-built tarball:
curl -sSL https://github.com/deeplethe/forkd/releases/download/v0.5.2/forkd-v0.5.2-x86_64-linux.tar.gz \
  | sudo tar -xz -C /usr/local/bin/

# Or from source:
cargo build --release
sudo install -m 0755 target/release/forkd-controller /usr/local/bin/
sudo install -m 0755 target/release/forkd            /usr/local/bin/
3

Start the daemon

sudo systemctl start forkd-controller
reconcile() runs on startup and prunes any sandbox entries whose Firecracker PID is no longer alive.
4

Verify the new version

curl http://127.0.0.1:8889/version
# {"version":"0.5.2","api":"v1"}

5. Backup: snapshot portability with forkd pack

Snapshot directories are self-contained and portable. Pack any registered snapshot into a single .tar.zst archive that bundles the vmstate, memory image, and a per-file sha256 manifest:
# On the source host:
forkd pack --tag pyagent --out pyagent.forkd-snapshot.tar.zst
# Typical compression: 23× (512 MiB memory.bin → ~22 MiB on disk)

# Transfer to another host, then restore:
forkd unpack pyagent.forkd-snapshot.tar.zst
# Integrity is verified via the manifest's sha256s on unpack.

# Fork immediately on the destination host:
sudo -E forkd fork --tag pyagent -n 10 --per-child-netns
For chains built with diff snapshots (v0.5), forkd pack bundles every ancestor in the chain into one tarball:
forkd pack --tag py-pandas --out py-pandas-chain.tar.zst
forkd unpack py-pandas-chain.tar.zst   # restores all chain links
Back up snapshot directories to object storage (S3, R2, GCS) using forkd push and restore with forkd pull. Integrity is verified on unpack — snapshots can be treated as immutable artefacts.

6. Capacity planning

On a 20 vCPU / 30 GiB host, forkd has been exercised at N=200 children sharing one snapshot in ~750 ms wall-clock. Sustained throughput depends on:
  • KSM enabled and tuned (scripts/setup-host.sh sets sensible defaults)
  • Parent snapshot memory image size — smaller parents fork faster
  • per_child_netns — adds ~3 ms per child for netns setup and agent probe round-trip
  • Snapshot chain depth — each additional link in a v0.5 diff chain adds ~450 ms to spawn time on ext4 (SHA-256 verification of the base layer)
Run bench/bench-spawn-100.sh on your own host to get hardware-specific numbers. For per-Pod sizing on Kubernetes, see the Enterprise deployment FAQ:
  • ~1 actively-running agent per vCPU (compute-bound)
  • ~50 idle-pooled agents per 8 GiB Pod RAM (process-state bottleneck)
  • 0.12 MiB CoW overhead per child at N=100 — memory rarely caps fan-out

Build docs developers (and LLMs) love