Uncloud uses a decentralized approach to state management. Instead of a central control plane, every machine maintains its own copy of the cluster state and synchronizes changes with peers. This architecture favors Availability and Partition tolerance over Consistency in the CAP theorem. The cluster remains operational even when machines are offline or network partitions occur.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/psviderski/uncloud/llms.txt
Use this file to discover all available pages before exploring further.
Corrosion Distributed Database
Uncloud’s state management is powered by Corrosion, a distributed SQLite database built by Fly.io. Corrosion runs on every machine and provides:- Distributed SQLite: Each machine runs its own SQLite database
- Real-time replication: Changes propagate to other machines in subseconds
- Gossip protocol: Machines exchange state updates through peer-to-peer gossip
- Schema management: Automatic database migrations across the cluster
Corrosion Architecture
On each machine, Corrosion runs as a systemd service:- Listens on the management IP: Uses the IPv6 management address for gossip traffic
- Stores data locally: Maintains a SQLite database at
/var/lib/uncloud/corrosion/store.db - Gossips with peers: Exchanges updates with other machines in the cluster
- Provides an API: Exposes a local API for the Uncloud daemon to query and update state
CRDT-Based Synchronization
Corrosion uses Conflict-free Replicated Data Types (CRDTs) to handle concurrent updates across machines. CRDTs are special data structures that can be modified independently on different machines and are mathematically guaranteed to converge to the same state when all updates have been applied.How CRDTs Work
Consider this scenario:- Machine A and Machine B both have service
webwith 2 replicas - Network partition separates the machines
- User scales service on Machine A to 3 replicas
- User scales service on Machine B to 4 replicas
- Network partition heals
CRDT Limitations
CRDTs provide automatic conflict resolution, but the “merged” state might not match what either user intended. In the example above:- Both users see their update succeed immediately (Availability)
- The final replica count will be consistent across machines (eventual Consistency)
- But the final count (3 or 4) depends on timestamps and may surprise both users
- Making most operations imperative rather than declarative
- Returning errors immediately when an operation can’t complete
- Directing operations to specific machines when possible
Eventual Consistency Model
Uncloud’s state is eventually consistent. This means:- Machines may have slightly different views of the cluster at any given moment
- All machines will converge to the same state after all updates have propagated
- Propagation typically happens in milliseconds to seconds
What This Means in Practice
When you runuc deploy, here’s what happens:
- Immediate execution: The CLI connects to a machine and executes the deployment
- Local state update: That machine updates its Corrosion database
- Gossip propagation: Corrosion gossips the changes to other machines
- Convergence: Within seconds, all machines have the updated state
uc ls immediately after uc deploy while connected to a different machine, you might briefly see the old state. Wait a second and run it again, and you’ll see the new containers.
Observing State Propagation
You can see state synchronization in action:Handling Network Partitions
Network partitions are inevitable in distributed systems. A partition occurs when machines can’t communicate with each other due to network failures, firewall issues, or machine crashes.Partition Detection
Corrosion detects partitions through its gossip protocol. When a machine stops receiving gossip messages from a peer, it marks that peer as potentially unreachable. You can see peer status in Corrosion logs:Operating in a Partition
When a partition occurs:- Each partition remains functional: Machines in each partition can still serve traffic and accept commands
- State diverges: Changes in one partition don’t reach the other partition
- Services continue: Containers keep running and serving requests
- Partition A: Machines in US East
- Partition B: Machines in EU West
- Deploy new services
- Scale existing services
- Access services running in their partition
Partition Healing
When network connectivity restores:- Gossip resumes: Machines start exchanging updates again
- State reconciliation: CRDTs merge the diverged states
- Convergence: All machines converge to the same state
State Reconciliation
Reconciliation is the process of merging diverged states after a partition or when a machine rejoins the cluster.Reconciliation Process
- Detect divergence: Machines compare their state versions using vector clocks
- Exchange missing updates: Each machine sends updates the other hasn’t seen
- Apply CRDT merge: Conflicting updates are resolved using CRDT rules
- Converge: Both machines reach the same final state
State Consistency Checks
While state is eventually consistent, Uncloud takes steps to ensure data integrity:- Schema versioning: Database schema changes are versioned and applied consistently
- Transaction boundaries: Related changes are grouped to prevent partial updates
- Validation: Invalid states are rejected before replication
Dealing with State Conflicts
If you suspect state inconsistencies across machines:- Check Corrosion sync status on each machine
- Review recent Corrosion logs for errors or warnings
- Wait for convergence (usually a few seconds)
- Verify consistency by running
uc pson different machines
- Network connectivity problems between machines
- Corrosion service failures
- Clock skew across machines (can affect timestamp-based CRDT operations)
Performance Characteristics
Write Performance
Writes happen locally and then replicate:- Local write latency: 1-10ms (SQLite write)
- Replication latency: 10-100ms (depending on network)
- Convergence time: 100-500ms (for small clusters)
Read Performance
Reads are always local:- Read latency: Less than 1ms (SQLite query)
- Consistency: May be slightly stale if recent writes haven’t propagated
Scalability
Corrosion’s gossip protocol scales well for small to medium clusters:- Optimal cluster size: 3-20 machines
- Maximum tested: 50+ machines
- Gossip overhead: Increases with cluster size (O(n) bandwidth per machine)
State Storage Location
Corrosion stores all cluster state in:- Machine registry and network configuration
- Service definitions and replica status
- Volume metadata
- DNS records
- Deployment history
Monitoring State Health
To monitor the health of state replication:- Regular gossip messages with all peers
- Recent successful sync operations
- No persistent errors or warnings
- Database size growing reasonably with cluster activity
Backup and Recovery
Since every machine has a full copy of cluster state, the cluster is inherently resilient to machine failures. To back up cluster state:- Stop the Corrosion service
- Replace the SQLite database file
- Start the Corrosion service
- Wait for reconciliation with other machines
- Multiple machines have redundant copies
- State can be recreated from machine configs
- Service definitions are typically stored in git (compose files)
Comparison to Traditional Architectures
| Aspect | Uncloud (Decentralized) | Kubernetes (Centralized) |
|---|---|---|
| Control plane | None (peer-to-peer) | etcd cluster (requires quorum) |
| Consistency | Eventual | Strong (linearizable reads) |
| Partition tolerance | Full (AP system) | Degraded (CP system) |
| Write path | Local → replicate | Consensus → local |
| Read path | Local (may be stale) | Local (always fresh) |
| Operational complexity | Low | High (manage etcd quorum) |
