Task Settings

Overview

Task settings control how documents are processed, chunked, and managed throughout their lifecycle. These settings include chunk processing, expiration, resolution, and error handling strategies.

Configuration Structure

Task configuration is provided when creating a new processing task:

{
  "file": "base64_encoded_file_or_url",
  "chunk_processing": {
    "target_length": 512,
    "tokenizer": "Word",
    "ignore_headers_and_footers": true
  },
  "high_resolution": true,
  "expires_in": 3600,
  "error_handling": "Continue",
  "llm_processing": {
    "model_id": "gpt-4o",
    "temperature": 0.0
  }
}

Chunk Processing

Controls how document segments are grouped into chunks for retrieval and embedding.

Parameters

chunk_processing.target_length

integer

default:512

Target number of tokens per chunk. If set to 0, each chunk contains exactly one segment.How it works:

Segments are combined until reaching the target length
Individual segments are never split (they remain intact)
Final chunks may exceed the target slightly to include complete segments

chunk_processing.tokenizer

string | enum

default:"Word"

Tokenizer for measuring chunk length. Supports:Predefined tokenizers:

Word - Simple whitespace-based tokenization
Cl100kBase - OpenAI tokenizer (GPT-3.5, GPT-4, text-embedding-ada-002)
xlm-roberta-base - Multilingual RoBERTa tokenizer
bert-base-uncased - BERT base uncased tokenizer

Custom HuggingFace tokenizers:

Any valid HuggingFace model ID (e.g., "Qwen/Qwen-tokenizer", "facebook/bart-large")

chunk_processing.ignore_headers_and_footers

boolean

default:true

Whether to exclude page headers and footers from chunks.Recommended: Keep this true as headers/footers often break reading order across pages.

Examples

Single Segment Chunks
Word-Based Chunking
OpenAI Tokenizer
Custom HuggingFace
Include Headers/Footers

{
  "chunk_processing": {
    "target_length": 0
  }
}

Each chunk contains exactly one segment. Useful for:

Fine-grained retrieval
Segment-level processing
Maximum precision in search

{
  "chunk_processing": {
    "target_length": 512,
    "tokenizer": "Word"
  }
}

Default configuration. Fast and simple tokenization based on whitespace.

{
  "chunk_processing": {
    "target_length": 512,
    "tokenizer": "Cl100kBase"
  }
}

Use when working with OpenAI embeddings (e.g., text-embedding-ada-002).

{
  "chunk_processing": {
    "target_length": 512,
    "tokenizer": "Qwen/Qwen-tokenizer"
  }
}

Use any HuggingFace tokenizer by providing its model ID.

{
  "chunk_processing": {
    "target_length": 512,
    "tokenizer": "Word",
    "ignore_headers_and_footers": false
  }
}

Include page headers and footers in chunks. Use when headers/footers contain important content.

Processing Strategies

OCR Strategy

ocr_strategy

enum

default:"All"

Controls Optical Character Recognition behavior:

All - Process all pages with OCR (Latency: ~0.5s per page)
Auto - Selectively apply OCR only to pages with missing or low-quality text

{
  "ocr_strategy": "All"
}

Auto mode uses existing text layers when available, falling back to OCR only when needed.

Segmentation Strategy

segmentation_strategy

enum

default:"LayoutAnalysis"

Controls document segmentation:

LayoutAnalysis - Detect layout elements (tables, pictures, formulas) with bounding boxes. Provides fine-grained segmentation.
Page - Treat each page as a single segment. Faster but without layout element detection.

{
  "segmentation_strategy": "LayoutAnalysis"
}

Error Handling

error_handling

enum

default:"Fail"

Controls how errors are handled during processing:

Fail - Stop processing and fail the task on any error
Continue - Attempt to continue despite non-critical errors (e.g., LLM refusals, rate limits)

Fail on Error
Continue on Error

{
  "error_handling": "Fail"
}

Use when:

Complete accuracy is critical
You want to manually review and fix errors
Processing can be safely retried

{
  "error_handling": "Continue"
}

Use when:

Partial results are acceptable
Processing large batches where individual failures shouldn’t block the entire task
You have fallback models configured

Resolution Settings

high_resolution

boolean

default:true

Whether to use high-resolution images for cropping and post-processing.Trade-offs:

true - Better quality for image segments, tables, and formulas (Latency: ~7s per page)
false - Faster processing with standard resolution

{
  "high_resolution": true
}

Task Expiration

expires_in

integer

Number of seconds until the task is deleted. Expired tasks cannot be:

Updated
Polled
Accessed via web interface

If not specified, uses the system default from JOB__EXPIRATION_TIME environment variable.

1 Hour
24 Hours
7 Days
No Expiration

{
  "expires_in": 3600
}

{
  "expires_in": 86400
}

{
  "expires_in": 604800
}

{
  "expires_in": null
}

Tasks without expiration consume storage indefinitely. Set appropriate cleanup policies.

LLM Processing

See the LLM Models page for detailed LLM configuration.

llm_processing.model_id

string

ID of the model to use (from your models.yaml). If not provided, the default model is used.

llm_processing.temperature

float

Temperature for LLM generation. Range: 0.0 (deterministic) to 2.0 (creative).

llm_processing.max_completion_tokens

integer

Maximum tokens in LLM responses. Limits output length and cost.

llm_processing.fallback_strategy

enum

default:"Default"

Fallback behavior when primary LLM fails:

None - No fallback
Default - Use configured fallback model
Model("id") - Use specific model

Complete Example

{
  "file": "base64_encoded_file_or_url",
  "file_name": "research_paper.pdf",
  "chunk_processing": {
    "target_length": 512,
    "tokenizer": "Cl100kBase",
    "ignore_headers_and_footers": true
  },
  "segment_processing": {
    "table": {
      "format": "Html",
      "strategy": "LLM"
    }
  },
  "ocr_strategy": "All",
  "segmentation_strategy": "LayoutAnalysis",
  "high_resolution": true,
  "error_handling": "Continue",
  "expires_in": 86400,
  "llm_processing": {
    "model_id": "gpt-4o",
    "temperature": 0.0,
    "max_completion_tokens": 4096,
    "fallback_strategy": "Default"
  }
}

Task Status

Tasks progress through these states:

Starting - Task queued and initializing
Processing - Active processing
Succeeded - Completed successfully
Failed - Encountered an error
Cancelled - Manually cancelled

Best Practices

Match tokenizer to your embedding model
- Use Cl100kBase for OpenAI embeddings
- Use model-specific tokenizers for other embeddings
Set appropriate chunk sizes
- Smaller chunks (256-512) for precise retrieval
- Larger chunks (1024+) for more context
Use Continue error handling for batch processing
- Prevents individual failures from blocking entire batches
- Review logs for partial failures
Configure expiration based on usage
- Short expiration (1 hour) for temporary processing
- Long expiration (7+ days) for production results
- Monitor storage usage
Enable high resolution selectively
- Use for documents with important visual elements
- Disable for text-heavy documents to improve speed

Getting Started

Core Concepts

Configuration

Deployment

Guides

Overview

Configuration Structure

Chunk Processing

Parameters

Examples

Processing Strategies

OCR Strategy

Segmentation Strategy

Error Handling

Resolution Settings

Task Expiration

LLM Processing

Complete Example

Task Status

Best Practices

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Configuration

Deployment

Guides

Documentation Index

​Overview

​Configuration Structure

​Chunk Processing

​Parameters

​Examples

​Processing Strategies

​OCR Strategy

​Segmentation Strategy

​Error Handling

​Resolution Settings

​Task Expiration

​LLM Processing

​Complete Example

​Task Status

​Best Practices

Build docs developers (and LLMs) love

Overview

Configuration Structure

Chunk Processing

Parameters

Examples

Processing Strategies

OCR Strategy

Segmentation Strategy

Error Handling

Resolution Settings

Task Expiration

LLM Processing

Complete Example

Task Status

Best Practices