Documentation Index
Fetch the complete documentation index at: https://mintlify.com/rustic-rs/rustic_core/llms.txt
Use this file to discover all available pages before exploring further.
The chunker module splits large files into smaller chunks (blobs) for deduplication and efficient storage. Rustic supports two chunking algorithms configured per repository.
Overview
Chunking converts files into variable or fixed-size pieces:
File (10 MB) → [Chunk 1][Chunk 2][Chunk 3] → Blobs
1.2 MB 850 KB 2.1 MB
Benefits:
- Deduplication: Identical chunks across files stored once
- Efficient updates: Only modified chunks need re-upload
- Parallelization: Process multiple chunks concurrently
Chunker Types
use rustic_core::repofile::configfile::Chunker;
pub enum Chunker {
Rabin, // Content-defined (default)
FixedSize, // Fixed-size chunks
}
Configured in ConfigFile:
let config = ConfigFile {
chunker: Some(Chunker::Rabin),
chunk_size: Some(1024 * 1024), // 1 MiB average
chunk_min_size: Some(512 * 1024), // 512 KiB min
chunk_max_size: Some(8 * 1024 * 1024), // 8 MiB max
..Default::default()
};
Rabin Chunker (Content-Defined)
Uses rolling hash fingerprints to find chunk boundaries based on content.
How It Works
- Sliding window: Compute hash over last 64 bytes
- Boundary detection: When
hash & mask == 0
- Min/max enforcement: Respect size constraints
use rustic_core::chunker::rabin::ChunkIter;
use rustic_cdc::{Rabin64, RollingHash64};
// Create Rabin chunker
let poly = 0x3DA3358B4DC173;
let rabin = Rabin64::new_with_polynom(poly);
let chunker = ChunkIter::new(
rabin,
1024 * 1024, // chunk_size (must be power of 2)
512 * 1024, // chunk_min_size
8 * 1024 * 1024, // chunk_max_size
file_reader,
file_size_hint,
)?;
// Iterate over chunks
for chunk in chunker {
let data = chunk?;
process_chunk(data);
}
Chunk Boundary Detection
The split mask determines average chunk size:
// chunk_size must be power of 2
let split_mask = chunk_size - 1;
// Boundary when low bits are zero
if (rabin.hash & split_mask) == 0 {
// End current chunk
}
Example (chunk_size = 1 MiB = 2^20):
- mask = 0xFFFFF (20 bits)
- Probability of boundary = 1 / 2^20 ≈ 1 / 1M bytes
- Average chunk size ≈ 1 MiB
Size Constraints
// Minimum size: always read this much first
reader.read_exact(&mut chunk[..chunk_min_size])?;
// Then look for boundary
loop {
if chunk.len() >= chunk_max_size {
break; // Force chunk end
}
if (rabin.hash & split_mask) == 0 {
break; // Natural boundary
}
// Keep reading...
}
Advantages
- Resilient to edits: Insertions/deletions only affect nearby chunks
- Better deduplication: Same content = same chunks regardless of position
- Variable sizes: Adapts to content structure
Configuration
use rustic_core::repofile::configfile::ConfigFile;
let config = ConfigFile {
chunker: Some(Chunker::Rabin),
chunker_polynomial: "3da3358b4dc173".to_string(),
chunk_size: Some(1 * 1024 * 1024), // 1 MiB average
chunk_min_size: Some(512 * 1024), // 512 KiB min
chunk_max_size: Some(8 * 1024 * 1024), // 8 MiB max
..Default::default()
};
Requirements:
chunk_size must be a power of 2
chunk_min_size ≤ chunk_size
chunk_max_size ≥ chunk_size
Fixed-Size Chunker
Splits files into equal-size chunks (except last).
Usage
use rustic_core::chunker::fixed_size::ChunkIter;
let chunk_size = 1024 * 1024; // 1 MiB
let chunker = ChunkIter::new(
chunk_size,
file_reader,
file_size_hint,
);
// Iterate over chunks
for chunk in chunker {
let data = chunk?;
assert!(data.len() <= chunk_size);
}
Advantages
- Simple: Predictable chunk sizes
- Fast: No hash computation
- Deterministic: Same file = same chunks every time
Disadvantages
- Poor deduplication: Insertions shift all following chunks
- File-level only: Can’t deduplicate parts of files
When to Use
- Small repositories (deduplication less important)
- Append-only files (no insertions)
- Maximum performance (no hashing overhead)
Chunking in Practice
During Backup
// Read file and chunk it
let file = File::open(path)?;
let file_size = file.metadata()?.len();
// Create chunker based on config
let chunker = match config.chunker() {
Chunker::Rabin => {
let rabin = Rabin64::new_with_polynom(config.poly()?);
ChunkIter::new_rabin(
rabin,
config.chunk_size(),
config.chunk_min_size(),
config.chunk_max_size(),
file,
file_size as usize,
)?
}
Chunker::FixedSize => {
ChunkIter::new_fixed(
config.chunk_size(),
file,
file_size as usize,
)
}
};
// Process chunks
let mut blob_ids = Vec::new();
for chunk in chunker {
let data = chunk?;
let blob_id = hash(&data).into();
// Check if blob exists
if !index.has_blob(&blob_id) {
packer.add(&data)?;
}
blob_ids.push(blob_id);
}
Chunk Storage
Chunks become data blobs:
// Each chunk = one blob
let chunk_data = chunker.next()?.unwrap();
let blob_id = hash(&chunk_data).into();
// Store in pack file
let encrypted = key.encrypt_data(&chunk_data)?;
packer.add_blob(blob_id, encrypted)?;
// Record in file node
node.content = Some(blob_ids);
Chunk Statistics
Distribution
Rabin chunking produces size distribution:
Min: 512 KiB ━━━━━━━━━━━━━━━━━━━━━━
Avg: 1 MiB ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Max: 8 MiB ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
└──────────────────────────────────┘
Chunk size range
Smaller chunks:
- ✓ Better deduplication
- ✗ More blobs to manage
- ✗ Higher index size
- ✗ More overhead
Larger chunks:
- ✓ Fewer blobs
- ✓ Lower overhead
- ✗ Less deduplication
- ✗ Larger transfers for small changes
Default (1 MiB avg) balances these trade-offs.
Chunker Polynomial
The polynomial affects chunk boundaries:
// Random polynomial generation
use rustic_cdc::{Polynom, Polynom64};
let poly = Polynom64::generate_irreducible_random(42)?;
println!("Polynomial: 0x{:x}", poly);
Recommendation: Use the default polynomial unless you have specific needs. Different polynomials produce different chunk boundaries, breaking deduplication across repositories.
Compatibility
Chunking is per-repository:
- Set once during
repo init
- Cannot be changed later
- Mixing chunkers breaks deduplication
// Check compatibility before copy/merge
if !source_config.has_same_chunker(&dest_config) {
return Err("Incompatible chunking configuration");
}
rustic_cdc Integration
Rustic uses the rustic_cdc crate:
use rustic_cdc::{Rabin64, RollingHash64};
let mut rabin = Rabin64::new_with_polynom(poly);
// Reset and prefill window
rabin.reset_and_prefill_window(initial_data.iter().copied());
// Slide over bytes
for byte in data {
rabin.slide(byte);
if (rabin.hash & split_mask) == 0 {
// Chunk boundary
}
}
- Buffer size: Use 4 KiB reads (default)
- Size hints: Provide accurate file size for memory allocation
- Parallel chunking: Process multiple files concurrently
- Reuse chunkers: Avoid recreating Rabin state
// Efficient: reuse chunker state
let mut rabin = Rabin64::new_with_polynom(poly);
for file in files {
let chunker = ChunkIter::new(rabin.clone(), ...);
process_file(chunker)?;
}
See Also