Overview
Chunking converts files into variable or fixed-size pieces:- Deduplication: Identical chunks across files stored once
- Efficient updates: Only modified chunks need re-upload
- Parallelization: Process multiple chunks concurrently
Chunker Types
ConfigFile:
Rabin Chunker (Content-Defined)
Uses rolling hash fingerprints to find chunk boundaries based on content.How It Works
- Sliding window: Compute hash over last 64 bytes
- Boundary detection: When
hash & mask == 0 - Min/max enforcement: Respect size constraints
Chunk Boundary Detection
The split mask determines average chunk size:- mask = 0xFFFFF (20 bits)
- Probability of boundary = 1 / 2^20 ≈ 1 / 1M bytes
- Average chunk size ≈ 1 MiB
Size Constraints
Advantages
- Resilient to edits: Insertions/deletions only affect nearby chunks
- Better deduplication: Same content = same chunks regardless of position
- Variable sizes: Adapts to content structure
Configuration
chunk_sizemust be a power of 2chunk_min_size ≤ chunk_sizechunk_max_size ≥ chunk_size
Fixed-Size Chunker
Splits files into equal-size chunks (except last).Usage
Advantages
- Simple: Predictable chunk sizes
- Fast: No hash computation
- Deterministic: Same file = same chunks every time
Disadvantages
- Poor deduplication: Insertions shift all following chunks
- File-level only: Can’t deduplicate parts of files
When to Use
- Small repositories (deduplication less important)
- Append-only files (no insertions)
- Maximum performance (no hashing overhead)
Chunking in Practice
During Backup
Chunk Storage
Chunks become data blobs:Chunk Statistics
Distribution
Rabin chunking produces size distribution:Performance Impact
Smaller chunks:- ✓ Better deduplication
- ✗ More blobs to manage
- ✗ Higher index size
- ✗ More overhead
- ✓ Fewer blobs
- ✓ Lower overhead
- ✗ Less deduplication
- ✗ Larger transfers for small changes
Chunker Polynomial
The polynomial affects chunk boundaries:Compatibility
Chunking is per-repository:- Set once during
repo init - Cannot be changed later
- Mixing chunkers breaks deduplication
rustic_cdc Integration
Rustic uses therustic_cdc crate:
Performance Tips
- Buffer size: Use 4 KiB reads (default)
- Size hints: Provide accurate file size for memory allocation
- Parallel chunking: Process multiple files concurrently
- Reuse chunkers: Avoid recreating Rabin state
See Also
- Blob Types - Chunk storage as blobs
- Repository Files - ConfigFile chunker settings
- Crypto - Hashing chunks for IDs