Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/lumina-ai-inc/chunkr/llms.txt

Use this file to discover all available pages before exploring further.

Chunking is the process of combining segments into semantically meaningful units for embedding and retrieval. Chunkr uses a sophisticated hierarchical chunking algorithm that respects document structure.

What is Chunking?

Chunking groups related segments together while:
  • ✅ Respecting document hierarchy (titles, sections)
  • ✅ Keeping related elements together (captions with images)
  • ✅ Honoring target length constraints
  • ✅ Maintaining semantic coherence
  • ✅ Optionally excluding headers and footers
Chunking happens after segmentation and segment processing. Each chunk contains one or more fully processed segments with their content, HTML, markdown, and optional LLM output.

Chunking Strategies

Chunkr supports two chunking approaches:
Semantic chunking with hierarchy awarenessWhen target_length > 0, Chunkr uses hierarchical chunking:
{
  "chunk_processing": {
    "target_length": 512,
    "tokenizer": "Word",
    "ignore_headers_and_footers": true
  }
}
Features:
  • Breaks at semantic boundaries (titles, sections)
  • Keeps captions with their images/tables
  • Respects document hierarchy
  • Targets specified token/word count
  • Never splits individual segments
Use cases:
  • RAG (Retrieval Augmented Generation)
  • Semantic search
  • Document Q&A
  • Content that needs context preservation

Hierarchical Chunking Algorithm

Hierarchy Levels

Chunkr assigns hierarchy levels to segment types:
function getHierarchyLevel(segmentType: SegmentType): number {
  switch (segmentType) {
    case 'Title':         return 3;  // Highest
    case 'SectionHeader': return 2;  // Medium
    default:              return 1;  // Lowest
  }
}
Chunking rules:
  • New chunks start when hierarchy increases
  • Example: Title (3) → SectionHeader (2) → starts new chunk
  • Lower hierarchy continues current chunk if space allows

Algorithm Flow

Pairing Logic

Chunkr keeps related elements together: Caption-Picture Pairing:
// Caption followed by Picture/Table
if (segment.type === 'Caption' && next.type === 'Picture') {
  // Keep together even if it requires new chunk
  if (currentLength + captionLength + pictureLength > targetLength) {
    startNewChunk();
  }
  addToChunk(caption);
  addToChunk(picture);
  markAsPaired();
}

// Picture followed by Caption
if (segment.type === 'Picture' && next.type === 'Caption') {
  // Same pairing logic
}

// Table can also pair with Caption
if (segment.type === 'Table' && next.type === 'Caption') {
  // Keep together
}
Pairing rules:
  • Caption + Picture stay together
  • Table + Caption stay together
  • Picture + Caption stay together
  • Once paired, elements are marked to prevent re-pairing
  • Pairing takes precedence over target length

Length Calculation

Chunk length is calculated using the configured tokenizer:
// Calculate segment length
function countEmbedWords(
  segment: Segment, 
  config: Configuration
): number {
  const embedContent = getEmbedContent(segment, config);
  
  switch (config.chunk_processing.tokenizer) {
    case 'Word':
      return embedContent.split(/\s+/).length;
    
    case 'Cl100kBase':
      const encoder = cl100k_base();
      return encoder.encode(embedContent).length;
    
    case 'xlm-roberta-base':
    case 'bert-base-uncased':
      const tokenizer = await loadTokenizer(modelName);
      return tokenizer.encode(embedContent).length;
    
    default:  // Custom HuggingFace tokenizer
      const customTokenizer = await loadTokenizer(
        config.chunk_processing.tokenizer
      );
      return customTokenizer.encode(embedContent).length;
  }
}

// Chunk length = sum of segment lengths
chunk.chunk_length = chunk.segments
  .map(s => countEmbedWords(s, config))
  .reduce((a, b) => a + b, 0);
Length calculation uses the embed content, which is determined by embed_sources configuration. See Embed Sources below.

Configuration

Chunk Processing Settings

{
  "chunk_processing": {
    "target_length": 512,
    "tokenizer": "Word",
    "ignore_headers_and_footers": true
  }
}
target_length
number
default:"512"
Target number of tokens/words per chunk.Behavior:
  • 0: One segment per chunk
  • > 0: Combine segments up to this length
Recommendations:
  • 256-512: Standard for most RAG applications
  • 512-1024: Longer context, fewer chunks
  • 128-256: Shorter chunks, more granular
Individual segments are never split, so chunks may exceed target length if a single segment is longer.
tokenizer
TokenizerType
default:"Word"
Tokenizer for length calculation.Options:
  • "Word": Simple whitespace tokenization (fastest)
  • "Cl100kBase": OpenAI tokenizer (GPT-3.5, GPT-4, ada-002)
  • "xlm-roberta-base": Multilingual RoBERTa
  • "bert-base-uncased": BERT base
  • Any HuggingFace tokenizer (e.g., "Qwen/Qwen-tokenizer")
Matching your embedding model:
{
  "tokenizer": "Cl100kBase"  // For text-embedding-ada-002
}
Use the same tokenizer as your embedding model for accurate chunk sizing.
ignore_headers_and_footers
boolean
default:"true"
Whether to exclude page headers and footers from chunks.Why exclude?
  • Headers/footers repeat across pages
  • They break reading order
  • They add noise to semantic search
When to include:
  • Document metadata is important
  • Page numbers are relevant
  • Custom header/footer processing needed

Embed Sources

Control what content goes into the chunk’s embed field:
{
  "segment_processing": {
    "Text": {
      "embed_sources": ["Content", "LLM"]
    },
    "Table": {
      "embed_sources": ["Content"]
    }
  }
}
Available sources:
  • "Content": Main content field (HTML or Markdown based on format)
  • "LLM": Custom LLM-generated content (if configured)
  • "HTML": (Deprecated) HTML representation
  • "Markdown": (Deprecated) Markdown representation
How it works:
// Generate embed text for chunk
chunk.embed = chunk.segments
  .map(segment => {
    const sources = [];
    
    for (const source of config.embed_sources) {
      if (source === 'Content') sources.push(segment.content);
      if (source === 'LLM' && segment.llm) sources.push(segment.llm);
      if (source === 'HTML') sources.push(segment.html);
      if (source === 'Markdown') sources.push(segment.markdown);
    }
    
    return sources.join('\n');
  })
  .join('\n');
The order of embed_sources determines the order of content in the embed field.

Chunk Structure

Each chunk contains:
interface Chunk {
  chunk_id: string;        // UUID
  chunk_length: number;    // Token/word count
  segments: Segment[];     // Array of segments
  embed?: string;          // Combined content for embedding
}

Example Chunk

{
  "chunk_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "chunk_length": 245,
  "segments": [
    {
      "segment_id": "seg-1",
      "segment_type": "Title",
      "text": "Introduction",
      "content": "# Introduction",
      "page_number": 1,
      // ... other fields
    },
    {
      "segment_id": "seg-2",
      "segment_type": "Text",
      "text": "This paper introduces a novel approach...",
      "content": "This paper introduces a novel approach...",
      "page_number": 1,
      // ... other fields
    }
  ],
  "embed": "# Introduction\n\nThis paper introduces a novel approach..."
}

Common Patterns

RAG Application

{
  "chunk_processing": {
    "target_length": 512,
    "tokenizer": "Cl100kBase",
    "ignore_headers_and_footers": true
  },
  "segment_processing": {
    "Text": {
      "embed_sources": ["Content"]
    },
    "Table": {
      "strategy": "LLM",
      "embed_sources": ["Content"]
    },
    "Picture": {
      "embed_sources": ["Content", "LLM"]
    }
  }
}
Embed the chunk.embed field for semantic search.

Element Extraction

{
  "chunk_processing": {
    "target_length": 0  // One segment per chunk
  },
  "segmentation_strategy": "LayoutAnalysis"
}
Process chunks based on segment_type:
const tables = chunks
  .filter(c => c.segments[0].segment_type === 'Table')
  .map(c => c.segments[0]);

Long-Form Content

{
  "chunk_processing": {
    "target_length": 1024,
    "tokenizer": "Word",
    "ignore_headers_and_footers": true
  }
}
Larger chunks for documents with long, connected passages.

Multilingual Documents

{
  "chunk_processing": {
    "target_length": 512,
    "tokenizer": "xlm-roberta-base",  // Multilingual
    "ignore_headers_and_footers": true
  }
}

Performance Considerations

Caching

Chunkr caches token counts for performance:
// LRU cache with 10,000 entries
const WORD_COUNT_CACHE = new LRUCache(10000);

function countEmbedWords(segment, config) {
  const cacheKey = `${segment.segment_id}-${config.tokenizer}`;
  
  // Check cache first
  if (WORD_COUNT_CACHE.has(cacheKey)) {
    return WORD_COUNT_CACHE.get(cacheKey);
  }
  
  // Calculate and cache
  const count = calculateTokenCount(segment, config);
  WORD_COUNT_CACHE.set(cacheKey, count);
  return count;
}

Parallel Processing

Token counting is parallelized:
// Pre-calculate all token counts in parallel
segments.parallelMap(segment => {
  segment.countEmbedWords(config);
});

// Then chunk sequentially (order matters)
for (const segment of segments) {
  // Token count already cached
  addToChunk(segment);
}

Tokenizer Performance

TokenizerSpeedAccuracyUse Case
Word⚡⚡⚡ FastestApproximateQuick estimates
Cl100kBase⚡⚡ FastOpenAI modelsGPT embeddings
xlm-roberta⚡ ModerateMultilingualMulti-language
Custom HF⚡ ModerateModel-specificMatch your model

Edge Cases

Segments Longer Than Target

Segments are never split, so chunks may exceed target length:
// If a single segment is 800 tokens and target is 512:
chunk = {
  chunk_length: 800,  // Exceeds target
  segments: [longSegment]
}

Empty Chunks

Chunks with only headers/footers may be empty when ignore_headers_and_footers: true:
// Page with only header and footer
segments = [
  { type: 'PageHeader', ... },
  { type: 'PageFooter', ... }
];

// Results in zero chunks (both ignored)
chunks = [];

Caption Without Picture

Unpaired elements are treated as regular segments:
segments = [
  { type: 'Caption', text: 'Figure 1: ...' },
  { type: 'Text', text: 'More content' }
];

// No pairing, chunked normally
chunk = {
  segments: [caption, text],
  chunk_length: combined_length
};

Debugging Chunks

Inspect chunking results:
// Log chunk structure
console.log(`Total chunks: ${chunks.length}`);

chunks.forEach((chunk, i) => {
  console.log(`\nChunk ${i + 1}:`);
  console.log(`  Length: ${chunk.chunk_length} tokens`);
  console.log(`  Segments: ${chunk.segments.length}`);
  
  chunk.segments.forEach(seg => {
    console.log(`    - ${seg.segment_type}: "${seg.text.slice(0, 50)}..."`);
  });
});

// Check for outliers
const avgLength = chunks.reduce((sum, c) => sum + c.chunk_length, 0) / chunks.length;
const outliers = chunks.filter(c => 
  c.chunk_length > avgLength * 2 || 
  c.chunk_length < avgLength * 0.5
);

console.log(`\nOutliers: ${outliers.length}`);

Next Steps

Segmentation

Understand how segments are created

Pipelines

See chunking in the full pipeline

API Reference

Complete API documentation

Overview

Back to concepts overview

Build docs developers (and LLMs) love