Why Deduplication Matters
Deduplication can reduce storage by 50-90% for typical backup scenarios by storing each unique piece of data only once.
Without Deduplication
With Deduplication
Content-Defined Chunking
Instead of splitting files at fixed offsets, CDC splits based on file content:Fixed-Size Chunking
❌ Split every 1MB regardless of contentProblem: Inserting data shifts all chunksAll chunks changed!
Content-Defined Chunking
✅ Split based on data patternsBenefit: Inserts only affect nearby chunksOnly one new chunk!
How Rabin Chunking Works
Rabin fingerprinting uses a rolling hash to find natural split points:Find Cut Points
When the hash matches a pattern (split mask), create a chunk:The pattern match creates chunks of average size ~1MB.
Rabin Polynomial
The chunker uses an irreducible polynomial for Rabin fingerprinting:Chunk Size Configuration
Chunk sizes affect deduplication efficiency and performance:| Parameter | Default | Description |
|---|---|---|
chunk_size | 1 MB | Average chunk size (must be power of 2) |
chunk_min_size | 512 KB | Minimum chunk size |
chunk_max_size | 8 MB | Maximum chunk size |
Choosing Chunk Sizes
- Smaller Chunks (512 KB avg)
- Default Chunks (1 MB avg)
- Larger Chunks (2-4 MB avg)
Pros:
- Better deduplication (finer granularity)
- More efficient for small changes
- More chunks = larger index
- Higher memory usage
- More overhead
Example Configuration
Deduplication Process
1. Chunking
Large files are split into chunks:2. Content Addressing
Each chunk gets a unique ID from its SHA-256 hash:Identical content always produces the same ID, regardless of:
- File name or path
- Modification time
- Location in repository
- Which backup it came from
3. Deduplication Check
Before storing, check if chunk already exists:4. Packing
New chunks are packed together for efficient storage:Deduplication Statistics
The backup summary shows deduplication effectiveness:Example Output
Calculating Deduplication Ratio
Global Deduplication
rustic_core deduplicates across all snapshots:Within Files
Identical chunks within a single file are deduplicated.Example: Sparse files, repeated patterns
Across Files
Identical chunks in different files are deduplicated.Example: Copies of files, similar documents
Across Snapshots
Chunks from different backups are deduplicated.Example: Unchanged files in incremental backups
Deduplication Example
Backing up 3 similar Linux machines:Most OS and application files are identical across machines!
Trade-offs
Storage vs Memory
Storage vs Memory
Better deduplication requires larger indexes:Index size grows with:
- Number of unique chunks
- Smaller chunk sizes (more chunks)
- Repository age (accumulated data)
Chunk Size vs Dedup Ratio
Chunk Size vs Dedup Ratio
Smaller chunks = better deduplication but higher overhead:
| Chunk Size | Dedup Ratio | Index Size | Performance |
|---|---|---|---|
| 256 KB | 95% | Large | Slower |
| 512 KB | 93% | Medium | Good |
| 1 MB | 90% | Small | Fast |
| 2 MB | 85% | Smaller | Faster |
| 4 MB | 80% | Smallest | Fastest |
Exact numbers depend on data characteristics. These are representative values.
CPU vs Storage
CPU vs Storage
CDC requires computing rolling hashes:Rabin chunking:
- CPU cost: Moderate (polynomial math)
- Benefit: Excellent deduplication
- Hardware acceleration: Available on modern CPUs
- CPU cost: Minimal (just counting)
- Benefit: Lower overhead
- Trade-off: Poor deduplication with file changes
Advanced: Rabin Polynomial Math
The Rabin chunker uses polynomial arithmetic in GF(2):Generating Random Polynomials
rustic can generate irreducible polynomials for new repositories:Monitoring Deduplication
Track deduplication efficiency over time:See Also
Repository
How deduplicated data is organized
Encryption
How encryption preserves deduplication
Backends
Where deduplicated packs are stored
Snapshots
How snapshots reference deduplicated chunks