Chunking

Chunking is the process of combining segments into semantically meaningful units for embedding and retrieval. Chunkr uses a sophisticated hierarchical chunking algorithm that respects document structure.

What is Chunking?

Chunking groups related segments together while:

✅ Respecting document hierarchy (titles, sections)
✅ Keeping related elements together (captions with images)
✅ Honoring target length constraints
✅ Maintaining semantic coherence
✅ Optionally excluding headers and footers

Chunking happens after segmentation and segment processing. Each chunk contains one or more fully processed segments with their content, HTML, markdown, and optional LLM output.

Chunking Strategies

Chunkr supports two chunking approaches:

Hierarchical Chunking
Segment-Level Chunks

Semantic chunking with hierarchy awarenessWhen target_length > 0, Chunkr uses hierarchical chunking:

{
  "chunk_processing": {
    "target_length": 512,
    "tokenizer": "Word",
    "ignore_headers_and_footers": true
  }
}

Features:

Breaks at semantic boundaries (titles, sections)
Keeps captions with their images/tables
Respects document hierarchy
Targets specified token/word count
Never splits individual segments

Use cases:

RAG (Retrieval Augmented Generation)
Semantic search
Document Q&A
Content that needs context preservation

One segment per chunkWhen target_length = 0, each segment becomes its own chunk:

{
  "chunk_processing": {
    "target_length": 0
  }
}

Features:

One-to-one segment-chunk mapping
No combining logic
Preserves exact segment boundaries
Simplest output structure

Use cases:

Element-level extraction
When you need fine-grained control
Processing specific segment types
Custom chunking logic downstream

Hierarchical Chunking Algorithm

Hierarchy Levels

Chunkr assigns hierarchy levels to segment types:

function getHierarchyLevel(segmentType: SegmentType): number {
  switch (segmentType) {
    case 'Title':         return 3;  // Highest
    case 'SectionHeader': return 2;  // Medium
    default:              return 1;  // Lowest
  }
}

Chunking rules:

New chunks start when hierarchy increases
Example: Title (3) → SectionHeader (2) → starts new chunk
Lower hierarchy continues current chunk if space allows

Algorithm Flow

Pairing Logic

Chunkr keeps related elements together: Caption-Picture Pairing:

// Caption followed by Picture/Table
if (segment.type === 'Caption' && next.type === 'Picture') {
  // Keep together even if it requires new chunk
  if (currentLength + captionLength + pictureLength > targetLength) {
    startNewChunk();
  }
  addToChunk(caption);
  addToChunk(picture);
  markAsPaired();
}

// Picture followed by Caption
if (segment.type === 'Picture' && next.type === 'Caption') {
  // Same pairing logic
}

// Table can also pair with Caption
if (segment.type === 'Table' && next.type === 'Caption') {
  // Keep together
}

Pairing rules:

Caption + Picture stay together
Table + Caption stay together
Picture + Caption stay together
Once paired, elements are marked to prevent re-pairing
Pairing takes precedence over target length

Length Calculation

Chunk length is calculated using the configured tokenizer:

// Calculate segment length
function countEmbedWords(
  segment: Segment, 
  config: Configuration
): number {
  const embedContent = getEmbedContent(segment, config);
  
  switch (config.chunk_processing.tokenizer) {
    case 'Word':
      return embedContent.split(/\s+/).length;
    
    case 'Cl100kBase':
      const encoder = cl100k_base();
      return encoder.encode(embedContent).length;
    
    case 'xlm-roberta-base':
    case 'bert-base-uncased':
      const tokenizer = await loadTokenizer(modelName);
      return tokenizer.encode(embedContent).length;
    
    default:  // Custom HuggingFace tokenizer
      const customTokenizer = await loadTokenizer(
        config.chunk_processing.tokenizer
      );
      return customTokenizer.encode(embedContent).length;
  }
}

// Chunk length = sum of segment lengths
chunk.chunk_length = chunk.segments
  .map(s => countEmbedWords(s, config))
  .reduce((a, b) => a + b, 0);

Length calculation uses the embed content, which is determined by embed_sources configuration. See Embed Sources below.

Configuration

Chunk Processing Settings

{
  "chunk_processing": {
    "target_length": 512,
    "tokenizer": "Word",
    "ignore_headers_and_footers": true
  }
}

target_length

number

default:"512"

Target number of tokens/words per chunk.Behavior:

0: One segment per chunk
> 0: Combine segments up to this length

Recommendations:

256-512: Standard for most RAG applications
512-1024: Longer context, fewer chunks
128-256: Shorter chunks, more granular

Individual segments are never split, so chunks may exceed target length if a single segment is longer.

tokenizer

TokenizerType

default:"Word"

Tokenizer for length calculation.Options:

"Word": Simple whitespace tokenization (fastest)
"Cl100kBase": OpenAI tokenizer (GPT-3.5, GPT-4, ada-002)
"xlm-roberta-base": Multilingual RoBERTa
"bert-base-uncased": BERT base
Any HuggingFace tokenizer (e.g., "Qwen/Qwen-tokenizer")

Matching your embedding model:

{
  "tokenizer": "Cl100kBase"  // For text-embedding-ada-002
}

Use the same tokenizer as your embedding model for accurate chunk sizing.

ignore_headers_and_footers

boolean

default:"true"

Whether to exclude page headers and footers from chunks.Why exclude?

Headers/footers repeat across pages
They break reading order
They add noise to semantic search

When to include:

Document metadata is important
Page numbers are relevant
Custom header/footer processing needed

Embed Sources

Control what content goes into the chunk’s embed field:

{
  "segment_processing": {
    "Text": {
      "embed_sources": ["Content", "LLM"]
    },
    "Table": {
      "embed_sources": ["Content"]
    }
  }
}

Available sources:

"Content": Main content field (HTML or Markdown based on format)
"LLM": Custom LLM-generated content (if configured)
"HTML": (Deprecated) HTML representation
"Markdown": (Deprecated) Markdown representation

How it works:

// Generate embed text for chunk
chunk.embed = chunk.segments
  .map(segment => {
    const sources = [];
    
    for (const source of config.embed_sources) {
      if (source === 'Content') sources.push(segment.content);
      if (source === 'LLM' && segment.llm) sources.push(segment.llm);
      if (source === 'HTML') sources.push(segment.html);
      if (source === 'Markdown') sources.push(segment.markdown);
    }
    
    return sources.join('\n');
  })
  .join('\n');

The order of embed_sources determines the order of content in the embed field.

Chunk Structure

Each chunk contains:

interface Chunk {
  chunk_id: string;        // UUID
  chunk_length: number;    // Token/word count
  segments: Segment[];     // Array of segments
  embed?: string;          // Combined content for embedding
}

Example Chunk

{
  "chunk_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "chunk_length": 245,
  "segments": [
    {
      "segment_id": "seg-1",
      "segment_type": "Title",
      "text": "Introduction",
      "content": "# Introduction",
      "page_number": 1,
      // ... other fields
    },
    {
      "segment_id": "seg-2",
      "segment_type": "Text",
      "text": "This paper introduces a novel approach...",
      "content": "This paper introduces a novel approach...",
      "page_number": 1,
      // ... other fields
    }
  ],
  "embed": "# Introduction\n\nThis paper introduces a novel approach..."
}

Common Patterns

RAG Application

{
  "chunk_processing": {
    "target_length": 512,
    "tokenizer": "Cl100kBase",
    "ignore_headers_and_footers": true
  },
  "segment_processing": {
    "Text": {
      "embed_sources": ["Content"]
    },
    "Table": {
      "strategy": "LLM",
      "embed_sources": ["Content"]
    },
    "Picture": {
      "embed_sources": ["Content", "LLM"]
    }
  }
}

Embed the chunk.embed field for semantic search.

Element Extraction

{
  "chunk_processing": {
    "target_length": 0  // One segment per chunk
  },
  "segmentation_strategy": "LayoutAnalysis"
}

Process chunks based on segment_type:

const tables = chunks
  .filter(c => c.segments[0].segment_type === 'Table')
  .map(c => c.segments[0]);

Long-Form Content

{
  "chunk_processing": {
    "target_length": 1024,
    "tokenizer": "Word",
    "ignore_headers_and_footers": true
  }
}

Larger chunks for documents with long, connected passages.

Multilingual Documents

{
  "chunk_processing": {
    "target_length": 512,
    "tokenizer": "xlm-roberta-base",  // Multilingual
    "ignore_headers_and_footers": true
  }
}

Performance Considerations

Caching

Chunkr caches token counts for performance:

// LRU cache with 10,000 entries
const WORD_COUNT_CACHE = new LRUCache(10000);

function countEmbedWords(segment, config) {
  const cacheKey = `${segment.segment_id}-${config.tokenizer}`;
  
  // Check cache first
  if (WORD_COUNT_CACHE.has(cacheKey)) {
    return WORD_COUNT_CACHE.get(cacheKey);
  }
  
  // Calculate and cache
  const count = calculateTokenCount(segment, config);
  WORD_COUNT_CACHE.set(cacheKey, count);
  return count;
}

Parallel Processing

Token counting is parallelized:

// Pre-calculate all token counts in parallel
segments.parallelMap(segment => {
  segment.countEmbedWords(config);
});

// Then chunk sequentially (order matters)
for (const segment of segments) {
  // Token count already cached
  addToChunk(segment);
}

Tokenizer Performance

Tokenizer	Speed	Accuracy	Use Case
Word	⚡⚡⚡ Fastest	Approximate	Quick estimates
Cl100kBase	⚡⚡ Fast	OpenAI models	GPT embeddings
xlm-roberta	⚡ Moderate	Multilingual	Multi-language
Custom HF	⚡ Moderate	Model-specific	Match your model

Edge Cases

Segments Longer Than Target

Segments are never split, so chunks may exceed target length:

// If a single segment is 800 tokens and target is 512:
chunk = {
  chunk_length: 800,  // Exceeds target
  segments: [longSegment]
}

Empty Chunks

Chunks with only headers/footers may be empty when ignore_headers_and_footers: true:

// Page with only header and footer
segments = [
  { type: 'PageHeader', ... },
  { type: 'PageFooter', ... }
];

// Results in zero chunks (both ignored)
chunks = [];

Caption Without Picture

Unpaired elements are treated as regular segments:

segments = [
  { type: 'Caption', text: 'Figure 1: ...' },
  { type: 'Text', text: 'More content' }
];

// No pairing, chunked normally
chunk = {
  segments: [caption, text],
  chunk_length: combined_length
};

Debugging Chunks

Inspect chunking results:

// Log chunk structure
console.log(`Total chunks: ${chunks.length}`);

chunks.forEach((chunk, i) => {
  console.log(`\nChunk ${i + 1}:`);
  console.log(`  Length: ${chunk.chunk_length} tokens`);
  console.log(`  Segments: ${chunk.segments.length}`);
  
  chunk.segments.forEach(seg => {
    console.log(`    - ${seg.segment_type}: "${seg.text.slice(0, 50)}..."`);
  });
});

// Check for outliers
const avgLength = chunks.reduce((sum, c) => sum + c.chunk_length, 0) / chunks.length;
const outliers = chunks.filter(c => 
  c.chunk_length > avgLength * 2 || 
  c.chunk_length < avgLength * 0.5
);

console.log(`\nOutliers: ${outliers.length}`);

Next Steps

Segmentation

Understand how segments are created

Pipelines

See chunking in the full pipeline

API Reference

Complete API documentation

Overview

Back to concepts overview

Getting Started

Core Concepts

Configuration

Deployment

Guides

What is Chunking?

Chunking Strategies

Hierarchical Chunking Algorithm

Hierarchy Levels

Algorithm Flow

Pairing Logic

Length Calculation

Configuration

Chunk Processing Settings

Embed Sources

Chunk Structure

Example Chunk

Common Patterns

RAG Application

Element Extraction

Long-Form Content

Multilingual Documents

Performance Considerations

Caching

Parallel Processing

Tokenizer Performance

Edge Cases

Segments Longer Than Target

Empty Chunks

Caption Without Picture

Debugging Chunks

Next Steps

Segmentation

Pipelines

API Reference

Overview

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Configuration

Deployment

Guides

Documentation Index

​What is Chunking?

​Chunking Strategies

​Hierarchical Chunking Algorithm

​Hierarchy Levels

​Algorithm Flow

​Pairing Logic

​Length Calculation

​Configuration

​Chunk Processing Settings

​Embed Sources

​Chunk Structure

​Example Chunk

​Common Patterns

​RAG Application

​Element Extraction

​Long-Form Content

​Multilingual Documents

​Performance Considerations

​Caching

​Parallel Processing

​Tokenizer Performance

​Edge Cases

​Segments Longer Than Target

​Empty Chunks

​Caption Without Picture

​Debugging Chunks

​Next Steps

Segmentation

Pipelines

API Reference

Overview

Build docs developers (and LLMs) love

What is Chunking?

Chunking Strategies

Hierarchical Chunking Algorithm

Hierarchy Levels

Algorithm Flow

Pairing Logic

Length Calculation

Configuration

Chunk Processing Settings

Embed Sources

Chunk Structure

Example Chunk

Common Patterns

RAG Application

Element Extraction

Long-Form Content

Multilingual Documents

Performance Considerations

Caching

Parallel Processing

Tokenizer Performance

Edge Cases

Segments Longer Than Target

Empty Chunks

Caption Without Picture

Debugging Chunks

Next Steps