Overview

Chunkr is a document intelligence API that transforms unstructured documents into structured, searchable data. It uses a multi-stage pipeline to extract, analyze, and organize document content.

Core Concepts

Chunkr processes documents through several key stages:

Segmentation

Detect and classify layout elements like tables, images, and text blocks

OCR

Extract text from images and documents with optical character recognition

Pipelines

Orchestrate processing steps from document upload to final output

Chunking

Combine segments into semantically meaningful chunks for embedding

Document Processing Flow

The typical document processing flow in Chunkr follows these steps:

1. Document Upload

Documents can be uploaded in various formats including PDF, images (JPEG, PNG), and office documents. All non-PDF formats are automatically converted to PDF for processing.

2. Page Conversion

PDF pages are converted to high-quality images for OCR and segmentation. The resolution can be controlled with the high_resolution parameter:

Standard resolution: Faster processing
High resolution: Better accuracy for complex layouts (~7 seconds latency per page)

3. Text Extraction

Text is extracted using OCR (Optical Character Recognition). Chunkr supports multiple OCR strategies:

All: Process all pages with OCR (~0.5 seconds per page)
Auto: Use existing text layer when available, apply OCR only when needed

4. Layout Analysis

The segmentation engine detects and classifies layout elements:

{
  "segment_type": "Table",
  "other_types": [
    "Title",
    "SectionHeader",
    "Text",
    "ListItem",
    "Picture",
    "Caption",
    "Formula",
    "Footnote",
    "PageHeader",
    "PageFooter",
    "Page"
  ]
}

Each segment contains:

Bounding box: Position and dimensions on the page
OCR results: Extracted text with confidence scores
Type classification: Element type (table, image, text, etc.)
Content: Processed HTML, Markdown, or LLM-generated output

5. Segment Processing

Segments are post-processed to generate structured content:

Auto generation: Heuristic-based HTML/Markdown conversion
LLM generation: Fine-tuned models for tables, formulas, and complex elements
Image cropping: Extract segment images for visual elements

6. Chunking

Segments are combined into chunks based on semantic boundaries and target length. The chunking algorithm:

Respects document hierarchy (titles, sections)
Keeps related elements together (captions with images)
Honors target token/word count limits
Optionally ignores headers and footers

7. Output Generation

The final output includes:

{
  "chunks": [
    {
      "chunk_id": "uuid",
      "chunk_length": 245,
      "segments": [...],
      "embed": "Combined text for embedding"
    }
  ],
  "page_count": 10,
  "pdf_url": "https://..."
}

Data Model

Segment

A segment represents a single layout element:

interface Segment {
  segment_id: string;
  segment_type: SegmentType;
  bbox: BoundingBox;
  confidence?: number;
  
  // Text content
  text: string;          // OCR-extracted text
  content: string;       // Formatted content (HTML or Markdown)
  html: string;          // HTML representation
  markdown: string;      // Markdown representation
  llm?: string;          // LLM-generated content
  
  // Metadata
  page_number: number;
  page_width: number;
  page_height: number;
  image?: string;        // URL to cropped segment image
  ocr?: OCRResult[];     // Detailed OCR results
}

Chunk

A chunk contains one or more segments:

interface Chunk {
  chunk_id: string;
  chunk_length: number;     // Token/word count
  segments: Segment[];
  embed?: string;           // Combined content for embedding
}

Bounding Box

All spatial information uses normalized coordinates:

interface BoundingBox {
  left: number;    // X coordinate
  top: number;     // Y coordinate  
  width: number;   // Box width
  height: number;  // Box height
}

Configuration

Chunkr’s behavior is controlled through the Configuration object:

{
  "segmentation_strategy": "LayoutAnalysis",
  "ocr_strategy": "Auto",
  "high_resolution": true,
  "chunk_processing": {
    "target_length": 512,
    "tokenizer": "Word",
    "ignore_headers_and_footers": true
  },
  "segment_processing": {
    "Table": {
      "strategy": "LLM",
      "format": "Html"
    }
  }
}

See Pipelines for detailed configuration options.

Next Steps

Learn about Pipelines

Understand the processing pipeline and configuration

Explore Segmentation

Deep dive into layout analysis strategies

OCR Strategies

Learn about text extraction methods

Chunking Algorithm

Understand how segments are combined into chunks

Getting Started

Core Concepts

Configuration

Deployment

Guides

Core Concepts

Segmentation

OCR

Pipelines

Chunking

Document Processing Flow

1. Document Upload

2. Page Conversion

3. Text Extraction

4. Layout Analysis

5. Segment Processing

6. Chunking

7. Output Generation

Data Model

Segment

Chunk

Bounding Box

Configuration

Next Steps

Learn about Pipelines

Explore Segmentation

OCR Strategies

Chunking Algorithm

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Configuration

Deployment

Guides

Documentation Index

​Core Concepts

Segmentation

OCR

Pipelines

Chunking

​Document Processing Flow

​1. Document Upload

​2. Page Conversion

​3. Text Extraction

​4. Layout Analysis

​5. Segment Processing

​6. Chunking

​7. Output Generation

​Data Model

​Segment

​Chunk

​Bounding Box

​Configuration

​Next Steps

Learn about Pipelines

Explore Segmentation

OCR Strategies

Chunking Algorithm

Build docs developers (and LLMs) love

Core Concepts

Document Processing Flow

1. Document Upload

2. Page Conversion

3. Text Extraction

4. Layout Analysis

5. Segment Processing

6. Chunking

7. Output Generation

Data Model

Segment

Chunk

Bounding Box

Configuration

Next Steps