Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/nubskr/walrus/llms.txt

Use this file to discover all available pages before exploring further.

Walrus is designed for high-performance streaming workloads. This guide covers optimization strategies, performance characteristics, and tuning parameters to maximize throughput and minimize latency.

Performance Overview

Benchmark Results

Walrus delivers exceptional performance compared to similar systems:

Without Fsync (Maximum Throughput)

SystemAvg Throughput (writes/s)Avg Bandwidth (MB/s)Max Throughput (writes/s)Max Bandwidth (MB/s)
Walrus1,205,762876.221,593,9841,158.62
Kafka1,112,120808.331,424,0731,035.74
RocksDB432,821314.531,000,000726.53

With Fsync (Durability Enabled)

SystemAvg Throughput (writes/s)Avg Bandwidth (MB/s)Max Throughput (writes/s)Max Bandwidth (MB/s)
RocksDB5,2223.7910,4867.63
Walrus4,9803.6011,3898.19
Kafka4,9213.5711,2248.34
Benchmarks compare single Kafka broker (no replication) and RocksDB’s WAL against Walrus using pwrite() syscalls. With io_uring batching enabled, Walrus can achieve significantly higher throughput.

io_uring: The Performance Multiplier

The most significant performance optimization in Walrus is io_uring support on Linux. io_uring provides:
  • Batched I/O submission: Submit multiple operations in a single syscall
  • Kernel-level parallelism: Concurrent disk operations
  • Reduced context switches: Fewer transitions between user and kernel space
  • Zero-copy operations: Direct memory access for reads/writes

Enabling io_uring

io_uring is enabled by default on Linux when using the FD (file descriptor) backend:
use walrus_rust::Walrus;

// io_uring is automatically used for batch operations on Linux
let wal = Walrus::new()?;

// Batch write - uses io_uring on Linux
let batch = vec![b"msg1".as_slice(), b"msg2".as_slice()];
wal.batch_append_for_topic("events", &batch)?;
Only disable io_uring if you encounter compatibility issues:
# Environment variable
export WALRUS_DISABLE_IO_URING=1

# Or in code
use walrus_rust::disable_fd_backend;
disable_fd_backend();
Disabling io_uring can reduce batch operation throughput by 3-10x. Only disable for debugging or compatibility reasons.

io_uring Requirements

  • OS: Linux kernel 5.1+ (5.6+ recommended)
  • Libraries: liburing installed
  • Architecture: x86_64, ARM64, or other supported platforms
Check io_uring availability:
# Check kernel version
uname -r
# Should be >= 5.1.0

# Check if liburing is available
ldconfig -p | grep liburing

Batch Operations

Batch operations are the key to high throughput. They allow Walrus to amortize overhead across multiple entries.

Batch Writes

use walrus_rust::Walrus;

let wal = Walrus::new()?;

// Atomic batch write (all-or-nothing)
let batch: Vec<&[u8]> = vec![
    b"entry 1",
    b"entry 2",
    b"entry 3",
    // ... up to 2,000 entries
];

wal.batch_append_for_topic("events", &batch)?;
Performance impact:
  • Batch write: 1 io_uring submission for 1,000 entries
  • Single writes: 1,000 separate syscalls
  • Speedup: 3-10x depending on entry size

Batch Reads

use walrus_rust::Walrus;

let wal = Walrus::new()?;

// Read up to 1MB of data in one call
let max_bytes = 1024 * 1024;
let entries = wal.batch_read_for_topic("events", max_bytes, true)?;

// Process entries
for entry in entries {
    println!("Read {} bytes", entry.data.len());
}
Best practices:
  • Set max_bytes based on your processing buffer size
  • At least 1 entry is always returned if available
  • Maximum 2,000 entries per call (configurable limit)

Batch Size Limits

MAX_BATCH_ENTRIES
integer
default:"2000"
Maximum entries per batch operation (read or write).Why 2,000? The default io_uring submission queue size is 2,047 entries. Staying below this ensures all operations fit in a single submission.
MAX_BATCH_BYTES
integer
default:"10GB"
Maximum total payload size per batch write.Why 10GB? Large enough for most use cases while keeping memory usage bounded during planning phase.

Segment Size Tuning

Segment size affects rollover frequency and load distribution across nodes.

Default Configuration

export WALRUS_MAX_SEGMENT_ENTRIES=1000000  # 1 million entries
With 1KB average entry size:
  • Segment size: ~1GB
  • Rollover time at 10k writes/sec: ~100 seconds
  • Leadership rotates every segment

Tuning Guidelines

Increase segment size to reduce rollover overhead:
export WALRUS_MAX_SEGMENT_ENTRIES=5000000  # 5 million entries
Benefits:
  • Fewer Raft consensus operations
  • Less frequent leadership rotation
  • Reduced metadata updates
Tradeoffs:
  • Longer time to distribute load to new nodes
  • Larger disk space per segment
Decrease segment size for faster load distribution:
export WALRUS_MAX_SEGMENT_ENTRIES=500000  # 500k entries
Benefits:
  • More frequent leadership rotation
  • Faster load distribution across nodes
  • Quicker adaptation to cluster changes
Tradeoffs:
  • More rollover overhead
  • More frequent Raft consensus operations
Decrease segment entries to maintain reasonable segment sizes:
# With 100KB average entry size
export WALRUS_MAX_SEGMENT_ENTRIES=100000  # Still ~10GB per segment
Rule of thumb: Target 1-10GB per segment regardless of entry count.

Monitor Interval Tuning

export WALRUS_MONITOR_CHECK_MS=10000  # Check every 10 seconds (default)
Faster rollover detection (lower values):
  • Reduce for development/testing: WALRUS_MONITOR_CHECK_MS=1000
  • Keep default (10s) for production - frequent checks add overhead
Slower rollover detection (higher values):
  • Increase if rollover overhead is noticeable: WALRUS_MONITOR_CHECK_MS=30000
  • Don’t set too high - delays load balancing

Fsync Scheduling

Fsync scheduling controls the durability vs. throughput tradeoff.

Strategies

use walrus_rust::{Walrus, FsyncSchedule};

// Fsync every 200ms (default)
let wal = Walrus::with_consistency_and_schedule(
    ReadConsistency::StrictlyAtOnce,
    FsyncSchedule::Milliseconds(200)
)?;
Characteristics:
  • Batches fsyncs across multiple writes
  • Good balance of durability and throughput
  • Risk: Up to 200ms of data loss on crash

Read Consistency Tuning

Read consistency affects how often read cursors are persisted.

StrictlyAtOnce (Default)

let wal = Walrus::with_consistency(ReadConsistency::StrictlyAtOnce)?;
Behavior:
  • Every read checkpoint is immediately persisted
  • Guarantees exactly-once delivery after restart
  • Higher durability, moderate read throughput
Use when: Duplicate processing is unacceptable

AtLeastOnce (Higher Throughput)

let wal = Walrus::with_consistency(
    ReadConsistency::AtLeastOnce { persist_every: 1000 }
)?;
Behavior:
  • Persist cursor every N reads
  • After restart, may replay up to N entries
  • Lower durability, higher read throughput
Use when: Duplicate processing is acceptable, throughput is critical

Tuning persist_every

// Low latency, low replay risk
ReadConsistency::AtLeastOnce { persist_every: 100 }

// High throughput, accept more replay
ReadConsistency::AtLeastOnce { persist_every: 10000 }

// Maximum throughput, large replay window
ReadConsistency::AtLeastOnce { persist_every: 100000 }
Balance based on your processing cost:
  • Expensive processing: Lower persist_every (100-1000)
  • Cheap processing: Higher persist_every (10000+)
  • Idempotent processing: Maximum persist_every

Storage Backend Selection

use walrus_rust::enable_fd_backend;

enable_fd_backend(); // Default, explicit call not needed
Advantages:
  • io_uring support on Linux
  • Better batch operation performance
  • O_SYNC support for SyncEach mode
Disadvantages:
  • Unix-only (not available on Windows)

Mmap Backend

use walrus_rust::disable_fd_backend;

disable_fd_backend(); // Switch to mmap
Advantages:
  • Cross-platform (works on Windows)
  • Direct memory access for reads
  • No file descriptor limits
Disadvantages:
  • No io_uring support
  • Lower batch operation performance
  • More complex memory management

Distributed Cluster Tuning

Lease Synchronization

The lease synchronization loop runs every 100ms by default. This is hardcoded but provides good balance:
  • 100ms: Fast enough for quick leader transitions
  • 100ms: Low enough overhead (0.1% CPU per node)
If you need faster leader transitions, you’ll need to modify the source code in controller/mod.rs. However, values below 50ms may increase CPU usage noticeably.

Raft Configuration

While most Raft parameters are configured through Octopii (the Raft implementation), key tuning points:
heartbeat_interval
duration
default:"150ms"
How often the leader sends heartbeats to followers.Tuning: Lower values detect failures faster but increase network traffic.
election_timeout
duration
default:"300-600ms"
Time before a follower starts an election.Tuning: Higher values reduce false elections, lower values speed up failure recovery.

System-Level Optimizations

Linux I/O Scheduler

# Check current scheduler
cat /sys/block/sda/queue/scheduler

# Set to none (for NVMe) or mq-deadline (for SSD)
echo none > /sys/block/sda/queue/scheduler
Recommended schedulers:
  • NVMe drives: none (direct I/O)
  • SSD: mq-deadline or bfq
  • HDD: deadline (not recommended for Walrus)

File System Tuning

# Mount with noatime for better performance
mount -o remount,noatime,nodiratime /data

# XFS is recommended for large files
mkfs.xfs -f /dev/sdb
mount -o noatime,nodiratime /dev/sdb /data
Recommended file systems:
  1. XFS - Best for large sequential writes
  2. ext4 - Good all-around performance
  3. ZFS - Advanced features, higher overhead

CPU Governor

# Set CPU to performance mode
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

Performance Profiling

Measuring Throughput

use std::time::Instant;
use walrus_rust::Walrus;

let wal = Walrus::new()?;
let batch: Vec<&[u8]> = vec![b"test"; 1000];

let start = Instant::now();
for _ in 0..100 {
    wal.batch_append_for_topic("bench", &batch)?;
}
let elapsed = start.elapsed();

let total_entries = 100 * 1000;
let throughput = total_entries as f64 / elapsed.as_secs_f64();
println!("Throughput: {:.0} entries/sec", throughput);

Profiling with perf

# Record performance data
sudo perf record -g cargo run --release --bin your-app

# Analyze results
sudo perf report
Look for hotspots in:
  • io_uring submission/completion
  • Memory allocation
  • Serialization/deserialization

Performance Checklist

1

Enable io_uring

Ensure Linux kernel >= 5.6 and FD backend enabled
2

Use batch operations

Replace single writes with batches (up to 2,000 entries)
3

Tune fsync schedule

Set to 200-500ms for most workloads, NoFsync for testing
4

Choose consistency model

Use AtLeastOnce with high persist_every for maximum throughput
5

Adjust segment size

Balance between rollover frequency and load distribution needs
6

Optimize system

Configure I/O scheduler, file system, and CPU governor

Next Steps

Monitoring

Monitor performance metrics in production

Troubleshooting

Diagnose performance issues

Build docs developers (and LLMs) love