Distributed State Management

Uncloud uses a decentralized approach to state management. Instead of a central control plane, every machine maintains its own copy of the cluster state and synchronizes changes with peers. This architecture favors Availability and Partition tolerance over Consistency in the CAP theorem. The cluster remains operational even when machines are offline or network partitions occur.

Corrosion Distributed Database

Uncloud’s state management is powered by Corrosion, a distributed SQLite database built by Fly.io. Corrosion runs on every machine and provides:

Distributed SQLite: Each machine runs its own SQLite database
Real-time replication: Changes propagate to other machines in subseconds
Gossip protocol: Machines exchange state updates through peer-to-peer gossip
Schema management: Automatic database migrations across the cluster

You can think of Corrosion as “SQLite with built-in multi-master replication.”

Corrosion Architecture

On each machine, Corrosion runs as a systemd service:

# Check Corrosion status
sudo systemctl status uncloud-corrosion

# View Corrosion logs
sudo journalctl -u uncloud-corrosion -f

The Corrosion process:

Listens on the management IP: Uses the IPv6 management address for gossip traffic
Stores data locally: Maintains a SQLite database at /var/lib/uncloud/corrosion/store.db
Gossips with peers: Exchanges updates with other machines in the cluster
Provides an API: Exposes a local API for the Uncloud daemon to query and update state

The daemon never writes to the SQLite file directly. All interactions happen through Corrosion’s API, which handles conflict resolution and replication.

CRDT-Based Synchronization

Corrosion uses Conflict-free Replicated Data Types (CRDTs) to handle concurrent updates across machines. CRDTs are special data structures that can be modified independently on different machines and are mathematically guaranteed to converge to the same state when all updates have been applied.

How CRDTs Work

Consider this scenario:

Machine A and Machine B both have service web with 2 replicas
Network partition separates the machines
User scales service on Machine A to 3 replicas
User scales service on Machine B to 4 replicas
Network partition heals

With traditional databases, this would be a conflict that requires manual resolution. With CRDTs, the two machines can automatically merge their states using predefined rules (for example, “last write wins” with vector clocks).

CRDT Limitations

CRDTs provide automatic conflict resolution, but the “merged” state might not match what either user intended. In the example above:

Both users see their update succeed immediately (Availability)
The final replica count will be consistent across machines (eventual Consistency)
But the final count (3 or 4) depends on timestamps and may surprise both users

This is the trade-off of a decentralized system. Uncloud mitigates this by:

Making most operations imperative rather than declarative
Returning errors immediately when an operation can’t complete
Directing operations to specific machines when possible

Eventual Consistency Model

Uncloud’s state is eventually consistent. This means:

Machines may have slightly different views of the cluster at any given moment
All machines will converge to the same state after all updates have propagated
Propagation typically happens in milliseconds to seconds

What This Means in Practice

When you run uc deploy, here’s what happens:

Immediate execution: The CLI connects to a machine and executes the deployment
Local state update: That machine updates its Corrosion database
Gossip propagation: Corrosion gossips the changes to other machines
Convergence: Within seconds, all machines have the updated state

If you run uc ls immediately after uc deploy while connected to a different machine, you might briefly see the old state. Wait a second and run it again, and you’ll see the new containers.

Observing State Propagation

You can see state synchronization in action:

# Terminal 1: Connect to machine A
uc context use machine-a
uc deploy

# Terminal 2: Connect to machine B immediately after
uc context use machine-b
uc ps  # May show old state
sleep 2
uc ps  # Should show new state

In practice, propagation is fast enough that you rarely notice the delay.

Handling Network Partitions

Network partitions are inevitable in distributed systems. A partition occurs when machines can’t communicate with each other due to network failures, firewall issues, or machine crashes.

Partition Detection

Corrosion detects partitions through its gossip protocol. When a machine stops receiving gossip messages from a peer, it marks that peer as potentially unreachable. You can see peer status in Corrosion logs:

sudo journalctl -u uncloud-corrosion | grep -i "peer"

Operating in a Partition

When a partition occurs:

Each partition remains functional: Machines in each partition can still serve traffic and accept commands
State diverges: Changes in one partition don’t reach the other partition
Services continue: Containers keep running and serving requests

For example, if your cluster splits into two partitions:

Partition A: Machines in US East
Partition B: Machines in EU West

Users in each region can still:

Deploy new services
Scale existing services
Access services running in their partition

They just can’t see or control services in the other partition.

Partition Healing

When network connectivity restores:

Gossip resumes: Machines start exchanging updates again
State reconciliation: CRDTs merge the diverged states
Convergence: All machines converge to the same state

The reconciliation process is automatic. You don’t need to manually resolve conflicts or rebuild state.

If you made conflicting changes during a partition (for example, deployed different services with the same name in each partition), the CRDT merge rules will pick one version. The other may disappear after reconciliation. Always check cluster state after a partition heals.

State Reconciliation

Reconciliation is the process of merging diverged states after a partition or when a machine rejoins the cluster.

Reconciliation Process

Detect divergence: Machines compare their state versions using vector clocks
Exchange missing updates: Each machine sends updates the other hasn’t seen
Apply CRDT merge: Conflicting updates are resolved using CRDT rules
Converge: Both machines reach the same final state

This all happens automatically in the background. You’ll see reconciliation activity in the Corrosion logs:

sudo journalctl -u uncloud-corrosion | grep -i "sync\|reconcile"

State Consistency Checks

While state is eventually consistent, Uncloud takes steps to ensure data integrity:

Schema versioning: Database schema changes are versioned and applied consistently
Transaction boundaries: Related changes are grouped to prevent partial updates
Validation: Invalid states are rejected before replication

Dealing with State Conflicts

If you suspect state inconsistencies across machines:

Check Corrosion sync status on each machine
Review recent Corrosion logs for errors or warnings
Wait for convergence (usually a few seconds)
Verify consistency by running uc ps on different machines

If issues persist, check for:

Network connectivity problems between machines
Corrosion service failures
Clock skew across machines (can affect timestamp-based CRDT operations)

Performance Characteristics

Write Performance

Writes happen locally and then replicate:

Local write latency: 1-10ms (SQLite write)
Replication latency: 10-100ms (depending on network)
Convergence time: 100-500ms (for small clusters)

Large state changes (for example, deploying many services) may take longer to fully propagate.

Read Performance

Reads are always local:

Read latency: Less than 1ms (SQLite query)
Consistency: May be slightly stale if recent writes haven’t propagated

For the freshest data, query the machine that performed the write.

Scalability

Corrosion’s gossip protocol scales well for small to medium clusters:

Optimal cluster size: 3-20 machines
Maximum tested: 50+ machines
Gossip overhead: Increases with cluster size (O(n) bandwidth per machine)

For very large clusters, you may need to partition services across multiple independent Uncloud clusters.

State Storage Location

Corrosion stores all cluster state in:

/var/lib/uncloud/corrosion/
├── config.toml          # Corrosion configuration
├── store.db             # SQLite database file
├── schema.sql           # Database schema
└── subscriptions/       # Replication metadata

The SQLite database contains:

Machine registry and network configuration
Service definitions and replica status
Volume metadata
DNS records
Deployment history

Never manually edit the SQLite database or Corrosion configuration files. Always use the Uncloud CLI to modify cluster state. Direct edits can cause replication conflicts and state corruption.

Monitoring State Health

To monitor the health of state replication:

# Check Corrosion service status
sudo systemctl status uncloud-corrosion

# View recent replication activity
sudo journalctl -u uncloud-corrosion --since "5 minutes ago"

# Check database size
du -h /var/lib/uncloud/corrosion/store.db

Healthy replication shows:

Regular gossip messages with all peers
Recent successful sync operations
No persistent errors or warnings
Database size growing reasonably with cluster activity

Backup and Recovery

Since every machine has a full copy of cluster state, the cluster is inherently resilient to machine failures. To back up cluster state:

# On any machine via SSH
sudo cp /var/lib/uncloud/corrosion/store.db ~/cluster-backup-$(date +%Y%m%d).db

To restore state (rarely needed):

Stop the Corrosion service
Replace the SQLite database file
Start the Corrosion service
Wait for reconciliation with other machines

In practice, you rarely need backups because:

Multiple machines have redundant copies
State can be recreated from machine configs
Service definitions are typically stored in git (compose files)

Comparison to Traditional Architectures

Aspect	Uncloud (Decentralized)	Kubernetes (Centralized)
Control plane	None (peer-to-peer)	etcd cluster (requires quorum)
Consistency	Eventual	Strong (linearizable reads)
Partition tolerance	Full (AP system)	Degraded (CP system)
Write path	Local → replicate	Consensus → local
Read path	Local (may be stale)	Local (always fresh)
Operational complexity	Low	High (manage etcd quorum)

Uncloud trades strong consistency for operational simplicity. For most applications, eventual consistency is perfectly acceptable and the operational benefits are substantial.

Get Started

Core Concepts

Deployment

Operations

Advanced

Distributed State Management

Corrosion Distributed Database

Corrosion Architecture

CRDT-Based Synchronization

How CRDTs Work

CRDT Limitations

Eventual Consistency Model

What This Means in Practice

Observing State Propagation

Handling Network Partitions

Partition Detection

Operating in a Partition

Partition Healing

State Reconciliation

Reconciliation Process

State Consistency Checks

Dealing with State Conflicts

Performance Characteristics

Write Performance

Read Performance

Scalability

State Storage Location

Monitoring State Health

Backup and Recovery

Comparison to Traditional Architectures

Build docs developers (and LLMs) love

Get Started

Core Concepts

Deployment

Operations

Advanced

Documentation Index

​Corrosion Distributed Database

​Corrosion Architecture

​CRDT-Based Synchronization

​How CRDTs Work

​CRDT Limitations

​Eventual Consistency Model

​What This Means in Practice

​Observing State Propagation

​Handling Network Partitions

​Partition Detection

​Operating in a Partition

​Partition Healing

​State Reconciliation

​Reconciliation Process

​State Consistency Checks

​Dealing with State Conflicts

​Performance Characteristics

​Write Performance

​Read Performance

​Scalability

​State Storage Location

​Monitoring State Health

​Backup and Recovery

​Comparison to Traditional Architectures

Build docs developers (and LLMs) love

Corrosion Distributed Database

Corrosion Architecture

CRDT-Based Synchronization

How CRDTs Work

CRDT Limitations

Eventual Consistency Model

What This Means in Practice

Observing State Propagation

Handling Network Partitions

Partition Detection

Operating in a Partition

Partition Healing

State Reconciliation

Reconciliation Process

State Consistency Checks

Dealing with State Conflicts

Performance Characteristics

Write Performance

Read Performance

Scalability

State Storage Location

Monitoring State Health

Backup and Recovery

Comparison to Traditional Architectures