Chunking Algorithms

The chunker module splits large files into smaller chunks (blobs) for deduplication and efficient storage. Rustic supports two chunking algorithms configured per repository.

Overview

Chunking converts files into variable or fixed-size pieces:

File (10 MB)  →  [Chunk 1][Chunk 2][Chunk 3]  →  Blobs
                   1.2 MB   850 KB   2.1 MB

Benefits:

Deduplication: Identical chunks across files stored once
Efficient updates: Only modified chunks need re-upload
Parallelization: Process multiple chunks concurrently

Chunker Types

use rustic_core::repofile::configfile::Chunker;

pub enum Chunker {
    Rabin,      // Content-defined (default)
    FixedSize,  // Fixed-size chunks
}

Configured in ConfigFile:

let config = ConfigFile {
    chunker: Some(Chunker::Rabin),
    chunk_size: Some(1024 * 1024),      // 1 MiB average
    chunk_min_size: Some(512 * 1024),   // 512 KiB min
    chunk_max_size: Some(8 * 1024 * 1024), // 8 MiB max
    ..Default::default()
};

Rabin Chunker (Content-Defined)

Uses rolling hash fingerprints to find chunk boundaries based on content.

How It Works

Sliding window: Compute hash over last 64 bytes
Boundary detection: When hash & mask == 0
Min/max enforcement: Respect size constraints

use rustic_core::chunker::rabin::ChunkIter;
use rustic_cdc::{Rabin64, RollingHash64};

// Create Rabin chunker
let poly = 0x3DA3358B4DC173;
let rabin = Rabin64::new_with_polynom(poly);

let chunker = ChunkIter::new(
    rabin,
    1024 * 1024,     // chunk_size (must be power of 2)
    512 * 1024,      // chunk_min_size
    8 * 1024 * 1024, // chunk_max_size
    file_reader,
    file_size_hint,
)?;

// Iterate over chunks
for chunk in chunker {
    let data = chunk?;
    process_chunk(data);
}

Chunk Boundary Detection

The split mask determines average chunk size:

// chunk_size must be power of 2
let split_mask = chunk_size - 1;

// Boundary when low bits are zero
if (rabin.hash & split_mask) == 0 {
    // End current chunk
}

Example (chunk_size = 1 MiB = 2^20):

mask = 0xFFFFF (20 bits)
Probability of boundary = 1 / 2^20 ≈ 1 / 1M bytes
Average chunk size ≈ 1 MiB

Size Constraints

// Minimum size: always read this much first
reader.read_exact(&mut chunk[..chunk_min_size])?;

// Then look for boundary
loop {
    if chunk.len() >= chunk_max_size {
        break; // Force chunk end
    }
    if (rabin.hash & split_mask) == 0 {
        break; // Natural boundary
    }
    // Keep reading...
}

Advantages

Resilient to edits: Insertions/deletions only affect nearby chunks
Better deduplication: Same content = same chunks regardless of position
Variable sizes: Adapts to content structure

Configuration

use rustic_core::repofile::configfile::ConfigFile;

let config = ConfigFile {
    chunker: Some(Chunker::Rabin),
    chunker_polynomial: "3da3358b4dc173".to_string(),
    chunk_size: Some(1 * 1024 * 1024),     // 1 MiB average
    chunk_min_size: Some(512 * 1024),      // 512 KiB min
    chunk_max_size: Some(8 * 1024 * 1024), // 8 MiB max
    ..Default::default()
};

Requirements:

chunk_size must be a power of 2
chunk_min_size ≤ chunk_size
chunk_max_size ≥ chunk_size

Fixed-Size Chunker

Splits files into equal-size chunks (except last).

Usage

use rustic_core::chunker::fixed_size::ChunkIter;

let chunk_size = 1024 * 1024; // 1 MiB

let chunker = ChunkIter::new(
    chunk_size,
    file_reader,
    file_size_hint,
);

// Iterate over chunks
for chunk in chunker {
    let data = chunk?;
    assert!(data.len() <= chunk_size);
}

Advantages

Simple: Predictable chunk sizes
Fast: No hash computation
Deterministic: Same file = same chunks every time

Disadvantages

Poor deduplication: Insertions shift all following chunks
File-level only: Can’t deduplicate parts of files

When to Use

Small repositories (deduplication less important)
Append-only files (no insertions)
Maximum performance (no hashing overhead)

Chunking in Practice

During Backup

// Read file and chunk it
let file = File::open(path)?;
let file_size = file.metadata()?.len();

// Create chunker based on config
let chunker = match config.chunker() {
    Chunker::Rabin => {
        let rabin = Rabin64::new_with_polynom(config.poly()?);
        ChunkIter::new_rabin(
            rabin,
            config.chunk_size(),
            config.chunk_min_size(),
            config.chunk_max_size(),
            file,
            file_size as usize,
        )?
    }
    Chunker::FixedSize => {
        ChunkIter::new_fixed(
            config.chunk_size(),
            file,
            file_size as usize,
        )
    }
};

// Process chunks
let mut blob_ids = Vec::new();
for chunk in chunker {
    let data = chunk?;
    let blob_id = hash(&data).into();
    
    // Check if blob exists
    if !index.has_blob(&blob_id) {
        packer.add(&data)?;
    }
    
    blob_ids.push(blob_id);
}

Chunk Storage

Chunks become data blobs:

// Each chunk = one blob
let chunk_data = chunker.next()?.unwrap();
let blob_id = hash(&chunk_data).into();

// Store in pack file
let encrypted = key.encrypt_data(&chunk_data)?;
packer.add_blob(blob_id, encrypted)?;

// Record in file node
node.content = Some(blob_ids);

Chunk Statistics

Distribution

Rabin chunking produces size distribution:

Min: 512 KiB  ━━━━━━━━━━━━━━━━━━━━━━
Avg: 1 MiB    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Max: 8 MiB    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
              └──────────────────────────────────┘
              Chunk size range

Performance Impact

Smaller chunks:

✓ Better deduplication
✗ More blobs to manage
✗ Higher index size
✗ More overhead

Larger chunks:

✓ Fewer blobs
✓ Lower overhead
✗ Less deduplication
✗ Larger transfers for small changes

Default (1 MiB avg) balances these trade-offs.

Chunker Polynomial

The polynomial affects chunk boundaries:

// Random polynomial generation
use rustic_cdc::{Polynom, Polynom64};

let poly = Polynom64::generate_irreducible_random(42)?;
println!("Polynomial: 0x{:x}", poly);

Recommendation: Use the default polynomial unless you have specific needs. Different polynomials produce different chunk boundaries, breaking deduplication across repositories.

Compatibility

Chunking is per-repository:

Set once during repo init
Cannot be changed later
Mixing chunkers breaks deduplication

// Check compatibility before copy/merge
if !source_config.has_same_chunker(&dest_config) {
    return Err("Incompatible chunking configuration");
}

rustic_cdc Integration

Rustic uses the rustic_cdc crate:

use rustic_cdc::{Rabin64, RollingHash64};

let mut rabin = Rabin64::new_with_polynom(poly);

// Reset and prefill window
rabin.reset_and_prefill_window(initial_data.iter().copied());

// Slide over bytes
for byte in data {
    rabin.slide(byte);
    
    if (rabin.hash & split_mask) == 0 {
        // Chunk boundary
    }
}

Performance Tips

Buffer size: Use 4 KiB reads (default)
Size hints: Provide accurate file size for memory allocation
Parallel chunking: Process multiple files concurrently
Reuse chunkers: Avoid recreating Rabin state

// Efficient: reuse chunker state
let mut rabin = Rabin64::new_with_polynom(poly);

for file in files {
    let chunker = ChunkIter::new(rabin.clone(), ...);
    process_file(chunker)?;
}

Core API

Operations

Data Types

Backends

Overview

Chunker Types

Rabin Chunker (Content-Defined)

How It Works

Chunk Boundary Detection

Size Constraints

Advantages

Configuration

Fixed-Size Chunker

Usage

Advantages

Disadvantages

When to Use

Chunking in Practice

During Backup

Chunk Storage

Chunk Statistics

Distribution

Performance Impact

Chunker Polynomial

Compatibility

rustic_cdc Integration

Performance Tips

See Also

Build docs developers (and LLMs) love

Core API

Operations

Data Types

Backends

Documentation Index

​Overview

​Chunker Types

​Rabin Chunker (Content-Defined)

​How It Works

​Chunk Boundary Detection

​Size Constraints

​Advantages

​Configuration

​Fixed-Size Chunker

​Usage

​Advantages

​Disadvantages

​When to Use

​Chunking in Practice

​During Backup

​Chunk Storage

​Chunk Statistics

​Distribution

​Performance Impact

​Chunker Polynomial

​Compatibility

​rustic_cdc Integration

​Performance Tips

​See Also

Build docs developers (and LLMs) love

Overview

Chunker Types

Rabin Chunker (Content-Defined)

How It Works

Chunk Boundary Detection

Size Constraints

Advantages

Configuration

Fixed-Size Chunker

Usage

Advantages

Disadvantages

When to Use

Chunking in Practice

During Backup

Chunk Storage

Chunk Statistics

Distribution

Performance Impact

Chunker Polynomial

Compatibility

rustic_cdc Integration

Performance Tips

See Also