Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/getployz/ployz/llms.txt

Use this file to discover all available pages before exploring further.

Ployz is a primitive orchestration core for small clusters. Its architectural bet is simple: small-scale infrastructure gets better when the system exposes real operational primitives instead of hiding them behind policy engines, controllers, and reconciler loops. Every state change is an explicit operation — add a machine, deploy a workload, migrate state, branch an environment, promote, roll back — with visible preconditions, a bounded effect, a clear result, and a way to verify what happened.

The core model

Ployz stores enough durable state to execute and explain explicit operations. It does not store a standing desired-state document that background controllers continuously reconcile. Policy belongs at decision time. The operator decides to add, migrate, deploy, branch, promote, roll back, or remove. Ployz executes that operation with clear preconditions and returns a result the operator can inspect. There are no autoscalers, hidden schedulers, or self-healing loops silently rewriting cluster truth. A mutating operation is foreground work with an audience. It should:
  1. Inspect current intent and live preconditions
  2. Build a plan when the operation has meaningful choices
  3. Fail before mutation when preconditions are missing
  4. Execute bounded steps against concrete participants
  5. Commit durable facts at the point of no return
  6. Report cleanup, partial progress, or failure explicitly
  7. Leave enough state for safe retry or operator repair
Retries must not turn uncertainty into success. A failed operation remains a fact until a later operation resolves it.

Three kinds of state

Ployz separates three kinds of truth:
  • Intent — what an operator explicitly asked the cluster to do
  • Status — durable lifecycle facts emitted by operations
  • Observation — live reachability, health, capacity, and freshness checked at decision time
Durable state does not infer liveness. Observations may be cached for diagnostics, but they do not silently become cluster policy.

Disposable daemon, durable data plane

ployzd is a disposable control plane. It can crash, upgrade, or restart without disrupting the data plane. WireGuard tunnels stay up, NATS keeps serving state, the gateway keeps proxying, DNS keeps resolving, and workload containers keep running. On startup, the daemon adopts what is already running and only recreates managed infrastructure whose identity has drifted.

System boundaries

Ployz is organized into four layers that interact through explicit contracts.

Operator surfaces

CLI, SDK, API, cloud UI, and agents. All consumers of the same primitive surface. None are the source of cluster truth. Operations need structured output, typed failures, idempotent retry behavior, and explicit verification hooks.

Orchestration kernel

Owns product semantics: machine membership, placement, deploy lifecycle, migration, transfer, branch, promote, rollback, coordination, and diagnostic policy. Depends on narrow contracts for runtime, store, network, and storage.

Runtime and substrate backends

Own substrate mechanics: Docker or host runtime operations, WireGuard setup, NATS process management, ZFS or other storage, gateway and DNS process supervision, and eBPF or bridge networking. Backends implement explicit contracts. They do not decide product policy.

Data plane services

The set of services that must keep serving last good state when ployzd is absent: workload containers, WireGuard mesh, NATS, gateway, DNS, and storage datasets. Daemon restart must not restart workloads.

Core domains

Code is organized by domain, not by adapter pattern.
DomainResponsibility
machineMachine identity, membership, join, update, remove, and operator surfaces for capacity
meshWireGuard overlay lifecycle, peer state, subnet coordination, and mesh phase state
storeDurable cluster facts, subscriptions, locks, streams, KV records, and memory/NATS implementations
coordinationLeases, participant commands, explicit foreground coordination, and failure reporting
deployPreview, placement, participant probing, apply, commit, cleanup, and deploy lifecycle facts
runtimeLocal container/process operations through narrow backend contracts
storageVolume creation, snapshot, clone, transfer, receive, migration, and rollback mechanics
routingRoute facts, gateway projection, DNS projection, and freshness handling
servicesLong-lived sidecar supervision for NATS, gateway, DNS, and supporting processes
daemonComposition root, request handling, startup adoption, and operation dispatch
SDK/APIExternal command surface and structured request/response types
WireGuard implementations live under the mesh domain because mesh owns overlay lifecycle. Store backends live under the store domain because store owns distributed state. Runtime backends live below the orchestration kernel because runtime mechanics are not product policy.

Runtime targets

Runtime selection happens at the daemon composition root. Core domains receive explicit backends instead of matching on an operator-facing mode enum.
Runtime targetService modeMeaning
DockerUserDocker-backed mesh/store/sidecars with loopback control-plane binding
HostUserHost-backed mesh/store, child-process sidecars, overlay control-plane binding
HostSystemHost-backed mesh/store, system-managed sidecars, overlay control-plane binding
Memory is test-only. It is not an operator-facing runtime and does not shape the daemon’s public API.

Docker runtime on macOS

The daemon runs on the macOS host. Everything else runs inside Docker Desktop’s Linux VM. NATS, gateway, and DNS bind on the node’s overlay IPv6 address so other mesh nodes can reach them directly. In the Docker runtime they share the ployz-networking network namespace to access wg0.
macOS host                         Docker Desktop VM
+----------------+                 +------------------------------+
| ployzd daemon  |                 | ployz-networking container   |
|                |  WG bridge      |   wg0 overlay interface      |
| OverlayBridge  +---------------->|                              |
|                |                 | nats-server                  |
| NATS bridge    +---------------->| ployz-gateway                |
|                |                 | ployz-dns                    |
|                |                 | workload containers          |
+----------------+                 +------------------------------+
OverlayBridge uses userspace WireGuard and a smoltcp TCP stack to bridge the macOS host to the container overlay network. eBPF TC classifiers intercept and redirect traffic at the kernel level where the runtime supports it.

Upgrade and adoption contract

The daemon separates ephemeral control-plane work from persistent data-plane services.
ComponentRestart behavior
WorkloadsNever touched by daemon restart
GatewayAdopted if running and config matches; recreated on drift
DNSAdopted if running and config matches; recreated on drift
NATSAdopted if running and parent netns unchanged; recreated on drift
WireGuardAdopted if healthy
CLI RPC, remote deploy, background command listenersEphemeral, restarted with daemon
All managed infrastructure follows the same adopt-first lifecycle: inspect what is already running, compare identity against the full expected specification, adopt matching infrastructure without touching it, recreate missing or drifted infrastructure with visible status. Docker containers carry identity as labels such as ployz.config-hash and ployz.parent-container-id. System services compare rendered unit identity.

Explore further

Cluster coordination with NATS

How NATS acts as the control-plane substrate: streams, KV buckets, leases, distributed locks, and the commitments that prevent split-brain and hidden state changes.

Routing, gateway, and DNS

How deploy truth is modeled, how the apply flow commits facts at points of no return, and how the gateway and DNS rebuild from durable routing state.

Design test

When evaluating a proposed change to Ployz architecture, ask these questions:
Ployz primitives are explicit commands with visible preconditions and bounded effects. If a feature encodes decisions into the cluster so that they happen without the operator choosing them, it is adding policy, not a primitive. Prefer the primitive.
A well-formed operation inspects intent and live preconditions first, builds a plan, and fails cleanly before touching anything if preconditions are not met. An operation that starts mutating before validating creates partial-state problems that are harder to recover from than a clean upfront failure.
Stored state should represent what an operator asked for and what explicitly happened. Health, reachability, and freshness are observed live at decision time. Storing inferred liveness as cluster truth leads to stale state serving silently — the worst failure class.
The system should be fully legible from its observable state. A verifiable result means the operator (or any automation) can confirm the outcome by reading durable facts, not by knowing that a reconciler will eventually make it true. If verification requires waiting for background convergence, the primitive is not done.
ployzd is disposable. Any design that causes a daemon restart to interrupt WireGuard, NATS, the gateway, DNS, or running workloads has broken the separation between control plane and data plane. The daemon adopts; it does not own the data plane’s lifecycle.
A developer running ployzctl dev on a Mac and a fleet operator running production share the same primitives. There is no dev-mode shortcut and no cloud-only mechanism. If a feature requires a separate model for one of these surfaces, the primitive needs to be strengthened, not forked.
If the answer to any design test question is no, the design is probably adding orchestration machinery where Ployz should be adding a better primitive.

Build docs developers (and LLMs) love