Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/lumina-ai-inc/chunkr/llms.txt

Use this file to discover all available pages before exploring further.

Overview

Task settings control how documents are processed, chunked, and managed throughout their lifecycle. These settings include chunk processing, expiration, resolution, and error handling strategies.

Configuration Structure

Task configuration is provided when creating a new processing task:
{
  "file": "base64_encoded_file_or_url",
  "chunk_processing": {
    "target_length": 512,
    "tokenizer": "Word",
    "ignore_headers_and_footers": true
  },
  "high_resolution": true,
  "expires_in": 3600,
  "error_handling": "Continue",
  "llm_processing": {
    "model_id": "gpt-4o",
    "temperature": 0.0
  }
}

Chunk Processing

Controls how document segments are grouped into chunks for retrieval and embedding.

Parameters

chunk_processing.target_length
integer
default:512
Target number of tokens per chunk. If set to 0, each chunk contains exactly one segment.How it works:
  • Segments are combined until reaching the target length
  • Individual segments are never split (they remain intact)
  • Final chunks may exceed the target slightly to include complete segments
chunk_processing.tokenizer
string | enum
default:"Word"
Tokenizer for measuring chunk length. Supports:Predefined tokenizers:
  • Word - Simple whitespace-based tokenization
  • Cl100kBase - OpenAI tokenizer (GPT-3.5, GPT-4, text-embedding-ada-002)
  • xlm-roberta-base - Multilingual RoBERTa tokenizer
  • bert-base-uncased - BERT base uncased tokenizer
Custom HuggingFace tokenizers:
  • Any valid HuggingFace model ID (e.g., "Qwen/Qwen-tokenizer", "facebook/bart-large")
chunk_processing.ignore_headers_and_footers
boolean
default:true
Whether to exclude page headers and footers from chunks.Recommended: Keep this true as headers/footers often break reading order across pages.

Examples

{
  "chunk_processing": {
    "target_length": 0
  }
}
Each chunk contains exactly one segment. Useful for:
  • Fine-grained retrieval
  • Segment-level processing
  • Maximum precision in search

Processing Strategies

OCR Strategy

ocr_strategy
enum
default:"All"
Controls Optical Character Recognition behavior:
  • All - Process all pages with OCR (Latency: ~0.5s per page)
  • Auto - Selectively apply OCR only to pages with missing or low-quality text
{
  "ocr_strategy": "All"
}
Auto mode uses existing text layers when available, falling back to OCR only when needed.

Segmentation Strategy

segmentation_strategy
enum
default:"LayoutAnalysis"
Controls document segmentation:
  • LayoutAnalysis - Detect layout elements (tables, pictures, formulas) with bounding boxes. Provides fine-grained segmentation.
  • Page - Treat each page as a single segment. Faster but without layout element detection.
{
  "segmentation_strategy": "LayoutAnalysis"
}

Error Handling

error_handling
enum
default:"Fail"
Controls how errors are handled during processing:
  • Fail - Stop processing and fail the task on any error
  • Continue - Attempt to continue despite non-critical errors (e.g., LLM refusals, rate limits)
{
  "error_handling": "Fail"
}
Use when:
  • Complete accuracy is critical
  • You want to manually review and fix errors
  • Processing can be safely retried

Resolution Settings

high_resolution
boolean
default:true
Whether to use high-resolution images for cropping and post-processing.Trade-offs:
  • true - Better quality for image segments, tables, and formulas (Latency: ~7s per page)
  • false - Faster processing with standard resolution
{
  "high_resolution": true
}

Task Expiration

expires_in
integer
Number of seconds until the task is deleted. Expired tasks cannot be:
  • Updated
  • Polled
  • Accessed via web interface
If not specified, uses the system default from JOB__EXPIRATION_TIME environment variable.
{
  "expires_in": 3600
}

LLM Processing

See the LLM Models page for detailed LLM configuration.
llm_processing.model_id
string
ID of the model to use (from your models.yaml). If not provided, the default model is used.
llm_processing.temperature
float
Temperature for LLM generation. Range: 0.0 (deterministic) to 2.0 (creative).
llm_processing.max_completion_tokens
integer
Maximum tokens in LLM responses. Limits output length and cost.
llm_processing.fallback_strategy
enum
default:"Default"
Fallback behavior when primary LLM fails:
  • None - No fallback
  • Default - Use configured fallback model
  • Model("id") - Use specific model

Complete Example

{
  "file": "base64_encoded_file_or_url",
  "file_name": "research_paper.pdf",
  "chunk_processing": {
    "target_length": 512,
    "tokenizer": "Cl100kBase",
    "ignore_headers_and_footers": true
  },
  "segment_processing": {
    "table": {
      "format": "Html",
      "strategy": "LLM"
    }
  },
  "ocr_strategy": "All",
  "segmentation_strategy": "LayoutAnalysis",
  "high_resolution": true,
  "error_handling": "Continue",
  "expires_in": 86400,
  "llm_processing": {
    "model_id": "gpt-4o",
    "temperature": 0.0,
    "max_completion_tokens": 4096,
    "fallback_strategy": "Default"
  }
}

Task Status

Tasks progress through these states:
  • Starting - Task queued and initializing
  • Processing - Active processing
  • Succeeded - Completed successfully
  • Failed - Encountered an error
  • Cancelled - Manually cancelled

Best Practices

  1. Match tokenizer to your embedding model
    • Use Cl100kBase for OpenAI embeddings
    • Use model-specific tokenizers for other embeddings
  2. Set appropriate chunk sizes
    • Smaller chunks (256-512) for precise retrieval
    • Larger chunks (1024+) for more context
  3. Use Continue error handling for batch processing
    • Prevents individual failures from blocking entire batches
    • Review logs for partial failures
  4. Configure expiration based on usage
    • Short expiration (1 hour) for temporary processing
    • Long expiration (7+ days) for production results
    • Monitor storage usage
  5. Enable high resolution selectively
    • Use for documents with important visual elements
    • Disable for text-heavy documents to improve speed

Build docs developers (and LLMs) love