Chunking is the process of combining segments into semantically meaningful units for embedding and retrieval. Chunkr uses a sophisticated hierarchical chunking algorithm that respects document structure.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/lumina-ai-inc/chunkr/llms.txt
Use this file to discover all available pages before exploring further.
What is Chunking?
Chunking groups related segments together while:- ✅ Respecting document hierarchy (titles, sections)
- ✅ Keeping related elements together (captions with images)
- ✅ Honoring target length constraints
- ✅ Maintaining semantic coherence
- ✅ Optionally excluding headers and footers
Chunking happens after segmentation and segment processing. Each chunk contains one or more fully processed segments with their content, HTML, markdown, and optional LLM output.
Chunking Strategies
Chunkr supports two chunking approaches:- Hierarchical Chunking
- Segment-Level Chunks
Semantic chunking with hierarchy awarenessWhen Features:
target_length > 0, Chunkr uses hierarchical chunking:- Breaks at semantic boundaries (titles, sections)
- Keeps captions with their images/tables
- Respects document hierarchy
- Targets specified token/word count
- Never splits individual segments
- RAG (Retrieval Augmented Generation)
- Semantic search
- Document Q&A
- Content that needs context preservation
Hierarchical Chunking Algorithm
Hierarchy Levels
Chunkr assigns hierarchy levels to segment types:- New chunks start when hierarchy increases
- Example:
Title(3) →SectionHeader(2) → starts new chunk - Lower hierarchy continues current chunk if space allows
Algorithm Flow
Pairing Logic
Chunkr keeps related elements together: Caption-Picture Pairing:- Caption + Picture stay together
- Table + Caption stay together
- Picture + Caption stay together
- Once paired, elements are marked to prevent re-pairing
- Pairing takes precedence over target length
Length Calculation
Chunk length is calculated using the configured tokenizer:Length calculation uses the
embed content, which is determined by embed_sources configuration. See Embed Sources below.Configuration
Chunk Processing Settings
Target number of tokens/words per chunk.Behavior:
0: One segment per chunk> 0: Combine segments up to this length
256-512: Standard for most RAG applications512-1024: Longer context, fewer chunks128-256: Shorter chunks, more granular
Tokenizer for length calculation.Options:
"Word": Simple whitespace tokenization (fastest)"Cl100kBase": OpenAI tokenizer (GPT-3.5, GPT-4, ada-002)"xlm-roberta-base": Multilingual RoBERTa"bert-base-uncased": BERT base- Any HuggingFace tokenizer (e.g.,
"Qwen/Qwen-tokenizer")
Whether to exclude page headers and footers from chunks.Why exclude?
- Headers/footers repeat across pages
- They break reading order
- They add noise to semantic search
- Document metadata is important
- Page numbers are relevant
- Custom header/footer processing needed
Embed Sources
Control what content goes into the chunk’sembed field:
"Content": Main content field (HTML or Markdown based onformat)"LLM": Custom LLM-generated content (if configured)"HTML": (Deprecated) HTML representation"Markdown": (Deprecated) Markdown representation
The order of
embed_sources determines the order of content in the embed field.Chunk Structure
Each chunk contains:Example Chunk
Common Patterns
RAG Application
chunk.embed field for semantic search.
Element Extraction
segment_type:
Long-Form Content
Multilingual Documents
Performance Considerations
Caching
Chunkr caches token counts for performance:Parallel Processing
Token counting is parallelized:Tokenizer Performance
| Tokenizer | Speed | Accuracy | Use Case |
|---|---|---|---|
| Word | ⚡⚡⚡ Fastest | Approximate | Quick estimates |
| Cl100kBase | ⚡⚡ Fast | OpenAI models | GPT embeddings |
| xlm-roberta | ⚡ Moderate | Multilingual | Multi-language |
| Custom HF | ⚡ Moderate | Model-specific | Match your model |
Edge Cases
Segments Longer Than Target
Segments are never split, so chunks may exceed target length:Empty Chunks
Chunks with only headers/footers may be empty whenignore_headers_and_footers: true:
Caption Without Picture
Unpaired elements are treated as regular segments:Debugging Chunks
Inspect chunking results:Next Steps
Segmentation
Understand how segments are created
Pipelines
See chunking in the full pipeline
API Reference
Complete API documentation
Overview
Back to concepts overview