Chunkr is a document intelligence API that transforms unstructured documents into structured, searchable data. It uses a multi-stage pipeline to extract, analyze, and organize document content.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/lumina-ai-inc/chunkr/llms.txt
Use this file to discover all available pages before exploring further.
Core Concepts
Chunkr processes documents through several key stages:Segmentation
Detect and classify layout elements like tables, images, and text blocks
OCR
Extract text from images and documents with optical character recognition
Pipelines
Orchestrate processing steps from document upload to final output
Chunking
Combine segments into semantically meaningful chunks for embedding
Document Processing Flow
The typical document processing flow in Chunkr follows these steps:1. Document Upload
Documents can be uploaded in various formats including PDF, images (JPEG, PNG), and office documents. All non-PDF formats are automatically converted to PDF for processing.2. Page Conversion
PDF pages are converted to high-quality images for OCR and segmentation. The resolution can be controlled with thehigh_resolution parameter:
- Standard resolution: Faster processing
- High resolution: Better accuracy for complex layouts (~7 seconds latency per page)
3. Text Extraction
Text is extracted using OCR (Optical Character Recognition). Chunkr supports multiple OCR strategies:All: Process all pages with OCR (~0.5 seconds per page)Auto: Use existing text layer when available, apply OCR only when needed
4. Layout Analysis
The segmentation engine detects and classifies layout elements:- Bounding box: Position and dimensions on the page
- OCR results: Extracted text with confidence scores
- Type classification: Element type (table, image, text, etc.)
- Content: Processed HTML, Markdown, or LLM-generated output
5. Segment Processing
Segments are post-processed to generate structured content:- Auto generation: Heuristic-based HTML/Markdown conversion
- LLM generation: Fine-tuned models for tables, formulas, and complex elements
- Image cropping: Extract segment images for visual elements
6. Chunking
Segments are combined into chunks based on semantic boundaries and target length. The chunking algorithm:- Respects document hierarchy (titles, sections)
- Keeps related elements together (captions with images)
- Honors target token/word count limits
- Optionally ignores headers and footers
7. Output Generation
The final output includes:Data Model
Segment
A segment represents a single layout element:Chunk
A chunk contains one or more segments:Bounding Box
All spatial information uses normalized coordinates:Configuration
Chunkr’s behavior is controlled through theConfiguration object:
Next Steps
Learn about Pipelines
Understand the processing pipeline and configuration
Explore Segmentation
Deep dive into layout analysis strategies
OCR Strategies
Learn about text extraction methods
Chunking Algorithm
Understand how segments are combined into chunks