Chunkr provides flexible chunking strategies to combine document segments into optimal chunks for retrieval-augmented generation (RAG) and embedding systems.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/lumina-ai-inc/chunkr/llms.txt
Use this file to discover all available pages before exploring further.
Understanding Chunks vs Segments
Segments are individual layout elements (paragraphs, tables, images) detected during analysis. Chunks are groups of segments combined for embedding.
- Segment: A single structural element (e.g., one paragraph, one table)
- Chunk: One or more segments grouped together based on your
target_length
Chunk Processing Configuration
Configure chunking behavior through thechunk_processing parameter:
Python
Target Length
Controls the approximate size of each chunk:- 512 tokens (Default)
- 1024 tokens
- 256 tokens
- 0 (Single Segment)
Chunkr never breaks segments apart - they remain intact within chunks. The
target_length is the maximum size, and Chunkr will fit as many complete segments as possible without exceeding it.Tokenizer Selection
Choose how text length is measured:- Word (Default)
- Cl100kBase
- XlmRobertaBase
- BertBaseUncased
- Custom HuggingFace Tokenizer
Ignore Headers and Footers
true
When enabled, page headers and footers are excluded from chunking but still available in the output if needed.
Segment Processing
Control how individual segments are processed and what content is included in chunks.Content Format Selection
Each segment type can be configured with a preferred output format:Python
Format Options
- Markdown
- Html
Generation Strategy
- Auto
- LLM
Embed Sources
Control which content is included in the chunk’sembed field:
Python
Content: The generated content (HTML or Markdown based on format)LLM: Custom LLM-generated output (when configured)HTML: ⚠️ Deprecated - useContentwithformat: HtmlMarkdown: ⚠️ Deprecated - useContentwithformat: Markdown
The order of sources in the array determines the sequence in the embed field. For example,
["Content", "LLM"] means content appears first, followed by LLM output.Complete Example
Python
Segment Types
All available segment types you can configure:| Segment Type | Default Format | Default Strategy | Description |
|---|---|---|---|
Title | Markdown | Auto | Document titles |
SectionHeader | Markdown | Auto | Section headings |
Text | Markdown | Auto | Body paragraphs |
ListItem | Markdown | Auto | Bullet/numbered lists |
Table | Html | LLM | Tables and grids |
Picture | Markdown | LLM | Images and figures |
Caption | Markdown | Auto | Image/table captions |
Formula | Markdown | LLM | Mathematical formulas |
Footnote | Markdown | Auto | Footnotes |
PageHeader | Markdown | Auto | Page headers |
PageFooter | Markdown | Auto | Page footers |
Page | Markdown | LLM | Full page (when using Page strategy) |
Chunking Strategies by Use Case
RAG for Q&A Systems
RAG for Q&A Systems
Long-Form Document Analysis
Long-Form Document Analysis
Fine-Grained Semantic Search
Fine-Grained Semantic Search
Segment-Level Processing
Segment-Level Processing
Best Practices
- Match tokenizer to your embedding model - Use
Cl100kBasefor OpenAI models, the corresponding tokenizer for others - Start with defaults - The default 512 tokens works well for most cases
- Consider your retrieval strategy - Smaller chunks for precise retrieval, larger for context-rich answers
- Use segment types strategically - Configure different formats for different content (HTML for tables, Markdown for text)
- Test with your data - Optimal settings vary by document type and use case
Next Steps
- Learn about VLM processing for enhanced content generation
- See processing documents for core API usage
- Review the migration guide for recent API changes