Documentation Index Fetch the complete documentation index at: https://mintlify.com/lumina-ai-inc/chunkr/llms.txt
Use this file to discover all available pages before exploring further.
The Configuration object controls how Chunkr processes your documents. It includes settings for OCR, segmentation, chunking, and post-processing.
Configuration Fields
Controls the Optical Character Recognition (OCR) strategy. Show OcrStrategy enum values
All - Processes all pages with OCR (Latency penalty: ~0.5 seconds per page)
Auto - Selectively applies OCR only to pages with missing or low-quality text. When text layer is present, the bounding boxes from the text layer are used
segmentation_strategy
SegmentationStrategy
required
Controls the segmentation strategy. Show SegmentationStrategy enum values
LayoutAnalysis - Analyzes pages for layout elements (e.g., Table, Picture, Formula, etc.) using bounding boxes. Provides fine-grained segmentation and better chunking
Page - Treats each page as a single segment. Faster processing, but without layout element detection and only simple chunking
Controls the chunking and post-processing of each chunk. Show ChunkProcessing properties
The target number of tokens in each chunk. If 0, each chunk will contain a single segment.
ignore_headers_and_footers
Whether to ignore headers and footers in the chunking process. This is recommended as headers and footers break reading order across pages.
tokenizer
TokenizerType
default: "Word"
The tokenizer to use for the chunking process. Show TokenizerType options
Can be either a predefined tokenizer or any Hugging Face tokenizer ID: Predefined tokenizers:
Word - Split text by word boundaries
Cl100kBase - For OpenAI models (GPT-3.5, GPT-4, text-embedding-ada-002)
xlm-roberta-base - For RoBERTa-based multilingual models
bert-base-uncased - BERT base uncased tokenizer
Custom tokenizers:
You can also specify any Hugging Face tokenizer by providing its model ID as a string (e.g., "facebook/bart-large", "Qwen/Qwen-tokenizer").
segment_processing
SegmentProcessing
required
Controls the post-processing of each segment type. See Segment Processing for detailed configuration options.
high_resolution
boolean
default: true
required
Whether to use high-resolution images for cropping and post-processing. (Latency penalty: ~7 seconds per page)
error_handling
ErrorHandlingStrategy
required
Controls how errors are handled during processing. Show ErrorHandlingStrategy enum values
Fail - Stops processing and fails the task when any error occurs
Continue - Attempts to continue processing despite non-critical errors (e.g., LLM refusals)
Controls the LLM used for the task. Show LlmProcessing properties
The ID of the model to use for the task. If not provided, the default model will be used. Please check the documentation for available models.
fallback_strategy
FallbackStrategy
default: "Default"
The fallback strategy to use for LLMs in the task. Show FallbackStrategy options
None - No fallback will be used
Default - Use the system default fallback model
Model(string) - Use a specific model as fallback
The maximum number of tokens to generate.
The temperature to use for the LLM.
The number of seconds until the task is deleted. Expired tasks cannot be updated, polled, or accessed via web interface.
The presigned URL of the input file.
Deprecated Fields
DEPRECATED : Use chunk_processing.target_length instead.
DEPRECATED : The extracted JSON schema from the document.
DEPRECATED : Model selection.Show Model enum values (deprecated)