Skip to main content
rustic_core uses content-defined chunking (CDC) with Rabin fingerprinting to efficiently deduplicate data across and within backups.

Why Deduplication Matters

Deduplication can reduce storage by 50-90% for typical backup scenarios by storing each unique piece of data only once.

Without Deduplication

Backup 1: [File A] [File B] [File C]          = 3 GB
Backup 2: [File A] [File B*] [File C] [File D] = 3.5 GB
Total: 6.5 GB stored

With Deduplication

Backup 1: [File A] [File B] [File C]          = 3 GB  
Backup 2: [File B* changed] [File D]          = 0.5 GB (reuses A and C)
Total: 3.5 GB stored (46% reduction)

Content-Defined Chunking

Instead of splitting files at fixed offsets, CDC splits based on file content:

Fixed-Size Chunking

❌ Split every 1MB regardless of contentProblem: Inserting data shifts all chunks
Before: [AAAA][BBBB][CCCC]
After:  [xAAA][ABBB][BCCC][C...]
All chunks changed!

Content-Defined Chunking

✅ Split based on data patternsBenefit: Inserts only affect nearby chunks
Before: [AAAA][BBBB][CCCC]
After:  [x][AAAA][BBBB][CCCC]
Only one new chunk!

How Rabin Chunking Works

Rabin fingerprinting uses a rolling hash to find natural split points:
1

Rolling Hash

Compute a polynomial hash over a sliding window (64 bytes):
use rustic_cdc::Rabin64;

let poly = 0x003D_A335_8B4D_C173;  // Irreducible polynomial
let rabin = Rabin64::new_with_polynom(6, &poly);
2

Find Cut Points

When the hash matches a pattern (split mask), create a chunk:
let split_mask = chunk_size - 1;  // e.g., 0xFFFFF for 1MB

for byte in data {
    rabin.slide(byte);
    if (rabin.hash & split_mask) == 0 {
        // Split here!
        break;
    }
}
The pattern match creates chunks of average size ~1MB.
3

Size Boundaries

Enforce minimum and maximum chunk sizes:
  • Min size (512 KB): Prevent tiny chunks
  • Max size (8 MB): Force split if no pattern found
if size < min_size {
    continue;  // Keep reading
}
if size >= max_size {
    break;     // Force split
}

Rabin Polynomial

The chunker uses an irreducible polynomial for Rabin fingerprinting:
pub struct ConfigFile {
    pub chunker_polynomial: String,  // "3da3358b4dc173" (hex)
    pub chunk_size: Option<usize>,   // 1048576 (1 MB average)
    pub chunk_min_size: Option<usize>,  // 524288 (512 KB)
    pub chunk_max_size: Option<usize>,  // 8388608 (8 MB)
}
The polynomial is stored in the repository config. All backups must use the same polynomial for deduplication to work.

Chunk Size Configuration

Chunk sizes affect deduplication efficiency and performance:
ParameterDefaultDescription
chunk_size1 MBAverage chunk size (must be power of 2)
chunk_min_size512 KBMinimum chunk size
chunk_max_size8 MBMaximum chunk size

Choosing Chunk Sizes

Pros:
  • Better deduplication (finer granularity)
  • More efficient for small changes
Cons:
  • More chunks = larger index
  • Higher memory usage
  • More overhead
Best for: Databases, logs, frequently changing files

Example Configuration

use rustic_core::ConfigOptions;

let config_opts = ConfigOptions {
    chunker: Some(Chunker::Rabin),
    chunk_size: Some(2 * 1024 * 1024),      // 2 MB average
    chunk_min_size: Some(1 * 1024 * 1024),  // 1 MB min
    chunk_max_size: Some(16 * 1024 * 1024), // 16 MB max
    ..Default::default()
};
Chunk sizes are set at repository creation and cannot be changed. Choose carefully!

Deduplication Process

1. Chunking

Large files are split into chunks:
use rustic_core::chunker::ChunkIter;

let chunker = ChunkIter::from_config(&config, file_reader, file_size)?;

for chunk in chunker {
    let chunk_data = chunk?;
    // Process chunk...
}

2. Content Addressing

Each chunk gets a unique ID from its SHA-256 hash:
use rustic_core::crypto::hasher::hash;

let chunk_id = hash(&chunk_data);  // SHA-256
Identical content always produces the same ID, regardless of:
  • File name or path
  • Modification time
  • Location in repository
  • Which backup it came from

3. Deduplication Check

Before storing, check if chunk already exists:
// Look up chunk in index
if let Some(index_entry) = index.get_id(BlobType::Data, &chunk_id) {
    // Chunk exists! Skip upload
    statistics.files_unmodified += 1;
} else {
    // New chunk, need to save
    save_chunk(&chunk_id, &chunk_data)?;
    statistics.data_added += chunk_data.len();
}

4. Packing

New chunks are packed together for efficient storage:
// Multiple chunks -> single pack file  
let pack = Packer::new(
    be.clone(),
    BlobType::Data,
    indexer.clone(),
    config,
    total_size,
)?;

for chunk in new_chunks {
    pack.add(chunk_id, chunk_data)?;
}

let pack_id = pack.finalize()?;

Deduplication Statistics

The backup summary shows deduplication effectiveness:
pub struct SnapshotSummary {
    pub data_added: u64,         // Total uncompressed bytes
    pub data_added_packed: u64,   // After dedup + compression
    
    pub data_added_files: u64,    // New/changed file bytes
    pub data_added_files_packed: u64,  // Actual stored
}

Example Output

Files:       15,234 changed, 42 new, 156 modified
Size:        2.1 GB processed
Added:       512 MB to repository (75% dedup + compression)
Unchanged:   15,036 files reused from previous backup

Calculating Deduplication Ratio

let dedup_ratio = 1.0 - (summary.data_added_packed as f64 
                        / summary.data_added as f64);

println!("Deduplication saved {:.1}%", dedup_ratio * 100.0);
// Output: "Deduplication saved 75.6%"

Global Deduplication

rustic_core deduplicates across all snapshots:
1

Within Files

Identical chunks within a single file are deduplicated.Example: Sparse files, repeated patterns
2

Across Files

Identical chunks in different files are deduplicated.Example: Copies of files, similar documents
3

Across Snapshots

Chunks from different backups are deduplicated.Example: Unchanged files in incremental backups
4

Across Sources

Different backup sources can share chunks.Example: Backing up multiple machines with similar OS/software

Deduplication Example

Backing up 3 similar Linux machines:
Machine 1: 50 GB -> 50 GB stored
Machine 2: 50 GB -> +5 GB stored (90% dedup)
Machine 3: 50 GB -> +5 GB stored (90% dedup)

Total: 150 GB data -> 60 GB stored (60% savings)
Most OS and application files are identical across machines!

Trade-offs

Better deduplication requires larger indexes:Index size grows with:
  • Number of unique chunks
  • Smaller chunk sizes (more chunks)
  • Repository age (accumulated data)
Memory usage:
// Full index loads all blob metadata
let repo = repo.to_indexed()?;  // High memory

// ID-only index for backups
let repo = repo.to_indexed_ids()?;  // Low memory
Smaller chunks = better deduplication but higher overhead:
Chunk SizeDedup RatioIndex SizePerformance
256 KB95%LargeSlower
512 KB93%MediumGood
1 MB90%SmallFast
2 MB85%SmallerFaster
4 MB80%SmallestFastest
Exact numbers depend on data characteristics. These are representative values.
CDC requires computing rolling hashes:Rabin chunking:
  • CPU cost: Moderate (polynomial math)
  • Benefit: Excellent deduplication
  • Hardware acceleration: Available on modern CPUs
Alternative: Fixed-size chunking
  • CPU cost: Minimal (just counting)
  • Benefit: Lower overhead
  • Trade-off: Poor deduplication with file changes
pub enum Chunker {
    Rabin,      // Content-defined (default)
    FixedSize,  // Fixed boundaries
}

Advanced: Rabin Polynomial Math

The Rabin chunker uses polynomial arithmetic in GF(2):
pub trait PolynomExtend {
    fn irreducible(&self) -> bool;  // Check if polynomial is irreducible
    fn gcd(self, other: Self) -> Self;  // Greatest common divisor
    fn mulmod(self, other: Self, modulo: Self) -> Self;  // Multiply mod polynomial
}

Generating Random Polynomials

rustic can generate irreducible polynomials for new repositories:
use rustic_core::chunker::rabin::random_poly;

// Generate random irreducible polynomial of degree 53
let poly = random_poly()?;

// Use in repository config  
let config = ConfigFile::new(2, repo_id, poly);
Using different polynomials prevents deduplication between repositories, which can be useful for security (prevents fingerprinting attacks).

Monitoring Deduplication

Track deduplication efficiency over time:
use rustic_core::commands::repoinfo::RepoFileInfos;

let infos = repo.infos_files()?;

println!("Total packs: {}", infos.packs.len());
println!("Total blobs: {}", infos.blobs);
println!("Total size: {} bytes", infos.total_size);

// Calculate average deduplication
let compression_ratio = infos.total_size_compressed as f64 
                       / infos.total_size as f64;
println!("Overall compression: {:.1}%", (1.0 - compression_ratio) * 100.0);

See Also

Repository

How deduplicated data is organized

Encryption

How encryption preserves deduplication

Backends

Where deduplicated packs are stored

Snapshots

How snapshots reference deduplicated chunks

Build docs developers (and LLMs) love