Configuration

The Configuration object controls how Chunkr processes your documents. It includes settings for OCR, segmentation, chunking, and post-processing.

Configuration Fields

ocr_strategy

OcrStrategy

required

Controls the Optical Character Recognition (OCR) strategy.

Show OcrStrategy enum values

All - Processes all pages with OCR (Latency penalty: ~0.5 seconds per page)
Auto - Selectively applies OCR only to pages with missing or low-quality text. When text layer is present, the bounding boxes from the text layer are used

segmentation_strategy

SegmentationStrategy

required

Controls the segmentation strategy.

Show SegmentationStrategy enum values

LayoutAnalysis - Analyzes pages for layout elements (e.g., Table, Picture, Formula, etc.) using bounding boxes. Provides fine-grained segmentation and better chunking
Page - Treats each page as a single segment. Faster processing, but without layout element detection and only simple chunking

chunk_processing

ChunkProcessing

required

Controls the chunking and post-processing of each chunk.

Show ChunkProcessing properties

target_length

integer

default:512

The target number of tokens in each chunk. If 0, each chunk will contain a single segment.

ignore_headers_and_footers

boolean

default:true

Whether to ignore headers and footers in the chunking process. This is recommended as headers and footers break reading order across pages.

tokenizer

TokenizerType

default:"Word"

The tokenizer to use for the chunking process.

Show TokenizerType options

Can be either a predefined tokenizer or any Hugging Face tokenizer ID:Predefined tokenizers:

Word - Split text by word boundaries
Cl100kBase - For OpenAI models (GPT-3.5, GPT-4, text-embedding-ada-002)
xlm-roberta-base - For RoBERTa-based multilingual models
bert-base-uncased - BERT base uncased tokenizer

Custom tokenizers: You can also specify any Hugging Face tokenizer by providing its model ID as a string (e.g., "facebook/bart-large", "Qwen/Qwen-tokenizer").

segment_processing

SegmentProcessing

required

Controls the post-processing of each segment type. See Segment Processing for detailed configuration options.

high_resolution

boolean

default:true

required

Whether to use high-resolution images for cropping and post-processing. (Latency penalty: ~7 seconds per page)

error_handling

ErrorHandlingStrategy

required

Controls how errors are handled during processing.

Show ErrorHandlingStrategy enum values

Fail - Stops processing and fails the task when any error occurs
Continue - Attempts to continue processing despite non-critical errors (e.g., LLM refusals)

llm_processing

LlmProcessing

required

Controls the LLM used for the task.

Show LlmProcessing properties

model_id

string

The ID of the model to use for the task. If not provided, the default model will be used. Please check the documentation for available models.

fallback_strategy

FallbackStrategy

default:"Default"

The fallback strategy to use for LLMs in the task.

Show FallbackStrategy options

None - No fallback will be used
Default - Use the system default fallback model
Model(string) - Use a specific model as fallback

max_completion_tokens

integer

The maximum number of tokens to generate.

temperature

number

The temperature to use for the LLM.

expires_in

integer

The number of seconds until the task is deleted. Expired tasks cannot be updated, polled, or accessed via web interface.

input_file_url

string

The presigned URL of the input file.

Deprecated Fields

target_chunk_length

integer

deprecated

DEPRECATED: Use chunk_processing.target_length instead.

json_schema

object

deprecated

DEPRECATED: The extracted JSON schema from the document.

model

Model

deprecated

DEPRECATED: Model selection.

Show Model enum values (deprecated)

Fast
HighQuality

Overview

Tasks

Models

Configuration Fields

Deprecated Fields

Build docs developers (and LLMs) love

Overview

Tasks

Models

Documentation Index

​Configuration Fields

​Deprecated Fields

Build docs developers (and LLMs) love

Configuration Fields

Deprecated Fields