Effective monitoring is crucial for operating Walrus in production. This guide covers the metrics available, how to access them, and what to monitor for healthy cluster operation.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/nubskr/walrus/llms.txt
Use this file to discover all available pages before exploring further.
METRICS Command
The primary way to monitor a Walrus cluster is through theMETRICS command, which returns a JSON snapshot of Raft consensus state and cluster health.
Using the CLI
Example Output
Key Metrics to Monitor
Cluster Health Metrics
Current Raft state of this node. Valid values:
Leader, Follower, Candidate.Expected: One node should be Leader, all others should be Follower.Alert if: Multiple nodes report Leader (split-brain) or all nodes report Candidate (no quorum).Node ID of the current Raft leader.Expected: All nodes report the same leader ID.Alert if: Nodes disagree on leader, or
null (no leader elected).Current Raft term number. Increases with each leader election.Expected: Slowly increasing over time (only during planned restarts or failures).Alert if: Rapidly increasing (indicates frequent elections / network issues).
Replication Health
Per-follower replication status (only present on leader).Expected:
match_index close to last_log_index (< 100 entries behind).Alert if: Large gap between match_index and last_log_index (follower lagging).Index of the last entry in the Raft log.Expected: Monotonically increasing as cluster metadata changes occur.
Index of the last entry applied to the state machine.Expected: Equal to
last_log_index (all committed entries applied).Alert if: last_applied significantly behind last_log_index.Membership Configuration
Current cluster membership.Expected: All expected nodes present in
voters.Alert if: Missing nodes or unexpected learners.STATE Command (Topic Metadata)
Use theSTATE command to inspect individual topic configuration and segment status:
Example Output
Topic Health Metrics
Active segment ID currently accepting writes.Expected: Increases over time as segments roll over.
Node ID responsible for the current segment.Expected: Rotates across cluster nodes with each rollover.
Map of segment ID to final entry count for completed segments.Expected: Entry counts close to
WALRUS_MAX_SEGMENT_ENTRIES.Historical record of which node led each segment.Use case: Identify where to read sealed segment data.
System-Level Monitoring
Log Monitoring
Monitor logs for errors and warnings using standard log aggregation tools:Critical Log Patterns
INFO patterns indicating normal operation:
"Node 1 booting"- Startup"Registered node address"- Successful registration"Monitor loop started"- Background tasks running"Client listener bound"- Ready to accept connections
Disk Space Monitoring
Monitor disk usage for the data directory:Network Monitoring
Monitor inter-node communication:- Raft leader → followers: Regular heartbeats (every ~150ms)
- Client → any node: Request/response traffic
- Non-leader → leader: Forwarded write operations
Prometheus Integration (Future)
While Walrus doesn’t currently expose Prometheus metrics directly, you can build a scraper using the METRICS command:Health Checks
Implement health checks for load balancers and orchestration systems:Basic Health Check
Leader Check
Write Health Check
Monitoring Dashboards
Essential Metrics Dashboard
Create a monitoring dashboard with these panels:Cluster Overview
Cluster Overview
- Raft state: Visual indicator per node (Leader/Follower/Candidate)
- Current term: Line graph showing election frequency
- Leader ID: Single value showing current leader
- Quorum status: Boolean (healthy/unhealthy)
Replication Status
Replication Status
- Replication lag: Per-follower lag in log entries
- Last applied index: Ensure state machine is caught up
- Snapshot status: Last snapshot index and term
Topic Health
Topic Health
- Active segments: Current segment ID per topic
- Segment distribution: Leader distribution across nodes
- Rollover frequency: Segments created over time
System Resources
System Resources
- Disk usage: Per-node data directory size
- Network traffic: Raft + client port traffic rates
- Log errors: Count of ERROR log lines per minute
Alerting Rules
Configure alerts for critical conditions:Critical Alerts
Warning Alerts
Investigation Recommended
- Frequent leader elections (> 1 per hour)
- Follower replication lag > 100 entries
- Disk usage > 70%
- Elevated error log rates
- Slow rollover processing
Monitoring Best Practices
Track trends over time
Watch for gradual degradation: increasing lag, slower rollovers, growing disk usage
Troubleshooting with Metrics
See the Troubleshooting guide for specific issues, but here are quick diagnostic patterns:| Symptom | Check These Metrics | Likely Cause |
|---|---|---|
| Writes failing | current_leader, state | No leader elected |
| Slow reads | segment_leaders, STATE | Reading from wrong node |
| Frequent elections | current_term (rapidly increasing) | Network instability |
| Lag warnings | replication[N].match_index | Follower overloaded |
| Out of disk | Disk usage + sealed_segments | Retention not configured |
Next Steps
Performance Tuning
Optimize cluster performance
Troubleshooting
Diagnose and fix common issues