Documentation Index
Fetch the complete documentation index at: https://mintlify.com/lumina-ai-inc/chunkr/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Segment processing controls the post-processing, formatting, and generation of content for each segment type detected in your documents. You can configure output formats (HTML or Markdown), generation strategies (Auto or LLM), image cropping, and custom LLM prompts.Segment Types
Chunkr detects and processes these segment types:Title- Document titlesSectionHeader- Section and subsection headersText- Regular paragraph textListItem- List items (bulleted or numbered)Table- Tables and tabular dataPicture- Images, charts, diagramsCaption- Image and table captionsFormula- Mathematical formulas and equationsFootnote- Footnotes and referencesPageHeader- Page headersPageFooter- Page footersPage- Full page content (when usingPagesegmentation strategy)
Configuration Parameters
Each segment type can be configured with these parameters:Output format for the segment:
Html- HTML formatted contentMarkdown- Markdown formatted content
Html.Content generation strategy:
Auto- Use heuristics and rules (fast, no LLM cost)LLM- Use Chunkr fine-tuned models (higher quality, requires LLM)
Table, Formula, and Page segments, default is LLM.Image cropping behavior:
Auto- Crop only when needed for post-processingAll- Always crop to segment bounding boxNone- Never crop
image field.Custom prompt for LLM-based generation. Use this to provide specific instructions for how the segment should be processed.
Which content sources to include in the chunk’s
embed field:Content- The primary content (usesformatsetting)LLM- LLM-generated contentHTML- (deprecated) HTML contentMarkdown- (deprecated) Markdown content
Whether to provide the full page image as context for LLM generation. Useful for segments that need broader context.
Basic Examples
Default Configuration
HTML Output
LLM-Based Generation
Segment-Specific Configuration
- Tables
- Pictures
- Formulas
Tables default to HTML format with LLM generation for best quality:Options:
Htmlformat preserves table structure betterMarkdownformat for simpler tablesAutostrategy for basic tables (faster, no LLM cost)LLMstrategy for complex tables with merged cells, etc.
Advanced Features
Custom LLM Prompts
Provide specific instructions for how segments should be processed:Embedding Configuration
Control what content is included in chunk embeddings:- Content Only
- LLM Only
- Combined Sources
format setting.Extended Context
Provide full page context for better LLM understanding:- Tables that reference surrounding content
- Formulas with context-dependent notation
- Images that need page layout understanding
Complete Configuration Example
Output Fields
Each processed segment includes these fields:The
content field contains the formatted output based on your format setting. The deprecated html and markdown fields are still available for backwards compatibility.Best Practices
-
Use Auto strategy for simple segments
- Faster processing
- No LLM costs
- Good for text, headers, lists
-
Use LLM strategy for complex segments
- Tables with complex structure
- Mathematical formulas
- Images requiring description
-
Match format to your use case
Htmlfor tables and structured contentMarkdownfor general text and readability
-
Configure embed_sources carefully
- Include only necessary sources
- Reduces token usage for embeddings
- Improves retrieval relevance
-
Use extended_context sparingly
- Higher LLM costs
- Longer processing time
- Only when context is critical
-
Test custom prompts
- Start with default prompts
- Iterate based on output quality
- Be specific in instructions