Documentation Index
Fetch the complete documentation index at: https://mintlify.com/nubskr/walrus/llms.txt
Use this file to discover all available pages before exploring further.
Failure Modes
Distributed Walrus uses Raft consensus to tolerate node failures. Understanding how the system behaves under different failure scenarios is critical for operations.Fault Tolerance Overview
3-Node Cluster
Tolerates: 1 node failure
Quorum: 2 of 3 nodes
Availability: ✅ With 2 nodes
5-Node Cluster
Tolerates: 2 node failures
Quorum: 3 of 5 nodes
Availability: ✅ With 3 nodes
7-Node Cluster
Tolerates: 3 node failures
Quorum: 4 of 7 nodes
Availability: ✅ With 4 nodes
- Quorum size:
⌊N/2⌋ + 1 - Tolerated failures:
⌊N/2⌋
Single Node Failure
The most common failure scenario: one node crashes or becomes network-partitioned.Scenario: Follower Node Fails
Initial state:- Node 1: Raft leader, leads
logs:1 - Node 2: Follower, leads
logs:2 - Node 3: Follower, leads
logs:3
Why doesn't automatic rollover happen?
Why doesn't automatic rollover happen?
The system doesn’t automatically reassign segments from failed nodes because:
- Raft membership is the source of truth: Until the node is removed from Raft voters, it’s still considered part of the cluster
- Prevents premature reassignment: If Node 2 comes back in 30 seconds, we avoid unnecessary data movement
- Operator control: Forces explicit decision about whether the failure is temporary or permanent
- Reads to sealed segments on Node 2: FAIL (data is unavailable)
- Writes to active segment on Node 2: FAIL (leader is down)
- Reads/writes to segments on Node 1 and Node 3: SUCCEED (unaffected)
- Metadata operations: SUCCEED (quorum is maintained)
Short Outage (Node 2 restarts)
Node 2 comes back online, syncs Raft log from leader, resumes serving logs:2.Downtime: 0-5 minutes (depending on restart time)
Prolonged Outage (Node 2 is gone)
Operator explicitly removes Node 2 from the cluster.Remove node (requires Raft API access - not exposed via client protocol yet):Effect:
- Membership becomes [1, 3]
- Quorum is now 2 of 2 (both must be alive)
- Future rollovers rotate between Node 1 and Node 3
- Sealed segments on Node 2 remain inaccessible (data loss)
Scenario: Leader Node Fails
Initial state:- Node 1: Raft leader
- Node 2: Follower
- Node 3: Follower
Leader election time: Typically 1-5 seconds depending on election timeout configuration and network latency.
- Metadata operations: Briefly unavailable (1-5 seconds) during election, then resume
- Writes to Node 1’s segments: FAIL until Node 1 recovers or segments are reassigned
- Writes to Node 2 and Node 3’s segments: SUCCEED after election completes
Multiple Node Failures
Two Nodes Fail (3-Node Cluster)
Scenario:- Nodes 2 and 3 crash simultaneously
- Only Node 1 remains
- Restart Failed Nodes
- Force Single-Node Mode (Data Loss Risk)
Bring Node 2 or Node 3 back online to restore quorum.Effect:
- Quorum restored (2 of 3)
- Raft elects new leader
- Cluster resumes operation
- Data on Node 1 is preserved
Network Partition (Split Brain Prevention)
Scenario: Network partition splits cluster into two groups.| Partition | Nodes | Quorum | Status |
|---|---|---|---|
| A | Node 1, Node 2 | ✅ 2 of 3 | Operational |
| B | Node 3 | ❌ 1 of 3 | Unavailable |
- Elects a leader (Node 1 or Node 2)
- Continues accepting reads/writes
- Can commit metadata changes
- Cannot elect a leader (no quorum)
- Rejects all writes
- Cannot serve reads that require forwarding
Split-brain protection: Only one partition can operate (the majority). Prevents conflicting writes.
- Partition B did NOT accept writes (no quorum)
- Partition A’s writes are authoritative
- No conflict resolution needed
Data Loss Scenarios
Sealed Segment on Failed Node
Problem:- Segment
logs:1sealed on Node 2 (1,000,000 entries) - Node 2’s disk fails (data unrecoverable)
- Reads for entries 0-999,999 fail
- Distributed Walrus does NOT replicate sealed segment data across nodes
- Each sealed segment exists only on its original leader
- If that node’s disk fails, the data is lost
- External Backups
- Disk RAID
- Object Storage Archival
- Replication (Future)
Periodically snapshot Restore from backup after node recovery.
user_data/ directory:Active Segment on Failed Node
Problem:- Segment
logs:5active on Node 1 (500,000 entries written) - Node 1 crashes before segment is sealed
- In-progress writes may be lost
- Walrus WAL is durable (fsynced to disk)
- Entries written before crash are recoverable
- Raft metadata correctly reflects sealed segments
- Entries in OS page cache not yet fsynced (typically <100ms worth)
- In-flight writes that received
OKbut crashed before fsync
- Node 1’s offset tracker: Reset to 0 (in-memory state lost)
- Actual WAL entries: 499,950 (recovered from disk)
- Monitor loop recounts: Queries Walrus
get_topic_size()
The system safely handles restarts. Offset tracking is best-effort for rollover triggers; actual entry counts come from Walrus at recovery time.
Operational Procedures
Graceful Node Shutdown
To minimize disruption:Trigger Rollover (If Leader)
If the node leads any active segments, wait for or manually trigger rollover:
Planned Node Replacement
Replace Node 2 with a new Node 4:
Note: Sealed segments on Node 2 become unavailable. Restore from backups if needed.
Disaster Recovery
Scenario: All nodes crash simultaneously (data center power loss, etc.).- All Disks Intact
- Partial Disk Failure
- All Disks Lost
Restart all nodes in any order:Recovery:
- Raft metadata loaded from disk
- State machine restored
- Cluster resumes from last committed state
- No data loss (Walrus WAL is durable)
Monitoring and Alerts
Set up alerts for these conditions:Raft Health Checks
current_leader == nullfor >10 seconds → No leader electedcurrent_leaderdiffers across nodes → Split brain or stale statestate == "Candidate"for >30 seconds → Election failing
Node Availability
- Client port unreachable → Node down
- Raft port unreachable → Node partitioned
Segment Leader Distribution
- Skewed distribution (e.g., Node 1 leads 80% of segments) → Load imbalance
- Leader for active segment is down → Writes failing
Disk Usage
- Disk >80% full → Risk of write failures
- One node significantly larger → Uneven segment distribution
Testing Failure Scenarios
Use the test suite to validate recovery:Best Practices
Deploy in Multiple Availability Zones
Deploy in Multiple Availability Zones
Distribute nodes across AZs to tolerate zone failures:Trade-off: Higher Raft latency (cross-AZ network), but better availability.
Automate Node Replacement
Automate Node Replacement
Use orchestration (Kubernetes, Nomad) to automatically replace failed nodes:
Monitor Raft Lag
Monitor Raft Lag
Track High lag indicates network issues or slow disk.
last_log_index vs last_applied on followers:Set Up Log Aggregation
Set Up Log Aggregation
Centralize logs from all nodes:Makes diagnosing failures across nodes easier.
Next Steps
Deployment Guide
Review deployment best practices
Segment Management
Understand rollover and leadership