Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/lumina-ai-inc/chunkr/llms.txt

Use this file to discover all available pages before exploring further.

Segmentation is the process of detecting and classifying layout elements in documents. Chunkr’s segmentation engine identifies regions like tables, images, text blocks, and other document components.

Segmentation Strategies

Chunkr supports two segmentation strategies:
{
  "segmentation_strategy": "LayoutAnalysis"
}
Detect and classify layout elementsThe LayoutAnalysis strategy uses computer vision models to detect document layout elements and classify them into specific types.Features:
  • Detects 11 different segment types
  • Provides bounding boxes for each element
  • Assigns confidence scores
  • Enables fine-grained chunking
  • Supports complex document layouts
Use cases:
  • Academic papers with tables and formulas
  • Reports with mixed content
  • Documents requiring precise element extraction
  • When you need to process tables and images separately
Performance: Adds minimal latency (batched processing)

Segment Types

When using LayoutAnalysis, Chunkr detects these segment types:
Document or section titlesCharacteristics:
  • Large, prominent text
  • Usually at document/section start
  • High hierarchy level (3)
  • Often triggers new chunks
Example: Main document title, chapter headings
Section and subsection headersCharacteristics:
  • Medium hierarchy level (2)
  • Introduces new sections
  • Triggers chunk boundaries
  • Smaller than Title, larger than body text
Example: Section headings, subsection titles
Regular body text paragraphsCharacteristics:
  • Most common segment type
  • Hierarchy level 1
  • Combines with adjacent text in chunks
Example: Paragraphs, body content
Bullet points and numbered list itemsCharacteristics:
  • Individual list entries
  • Can be chunked together
  • Preserves list structure
Example: Bullet points, enumerated items
Tabular data and structured contentCharacteristics:
  • Complex structure
  • Usually processed with LLM strategy
  • Can be paired with Caption
  • Cropped image available
Example: Data tables, comparison chartsDefault processing: LLM-based HTML generation
Images, figures, and diagramsCharacteristics:
  • Visual content
  • Always cropped
  • Can be paired with Caption
  • May contain OCR results if text present
Example: Photos, diagrams, chartsDefault processing: Image URL with optional description
Image and table captionsCharacteristics:
  • Describes associated visual element
  • Kept with paired Picture/Table in chunks
  • Usually smaller text below/above element
Example: “Figure 1: Architecture diagram”
Mathematical formulas and equationsCharacteristics:
  • Mathematical notation
  • Processed with LLM for LaTeX generation
  • May be inline or block-level
Example: Equations, mathematical expressionsDefault processing: LLM-based LaTeX extraction
Footnotes and referencesCharacteristics:
  • Small text at page bottom
  • References to main content
  • Usually numbered or marked
Example: Citations, additional notes
Full page segment (only with Page strategy)Characteristics:
  • Entire page as one segment
  • Used when no layout analysis performed
  • Contains all page OCR results
When used: Only with segmentation_strategy: "Page"

Segment Structure

Each segment contains rich metadata:
interface Segment {
  // Identification
  segment_id: string;
  segment_type: SegmentType;
  
  // Position
  bbox: {
    left: number;
    top: number;
    width: number;
    height: number;
  };
  
  // Quality
  confidence?: number;  // 0.0 to 1.0
  
  // Content (generated in segment processing step)
  text: string;         // OCR-extracted text
  content: string;      // Formatted (HTML or Markdown)
  html: string;         // HTML representation
  markdown: string;     // Markdown representation
  llm?: string;         // LLM-generated content
  
  // Page context
  page_number: number;
  page_width: number;
  page_height: number;
  
  // Additional data
  image?: string;       // URL to cropped image
  ocr?: OCRResult[];    // Detailed OCR results
}

How Segmentation Works

1. Object Detection

Chunkr uses object detection models to identify layout elements: The model outputs:
  • Bounding boxes: [left, top, width, height] coordinates
  • Class predictions: Integer class IDs (0-10)
  • Confidence scores: Detection confidence (0.0-1.0)

2. Class Mapping

Class IDs are mapped to segment types:
0 => Caption
1 => Footnote
2 => Formula
3 => ListItem
4 => PageFooter
5 => PageHeader
6 => Picture
7 => SectionHeader
8 => Table
9 => Text
10 => Title

3. OCR Assignment

OCR results are assigned to segments based on spatial overlap:
  1. Add padding to segment bounding boxes
  2. Calculate intersection area with each OCR result
  3. Assign OCR result to segment with maximum overlap
  4. Adjust OCR coordinates relative to segment
Segmentation padding is configurable via segmentation_padding in worker config. Default padding ensures OCR results near segment edges are captured.

4. Fallback Handling

If no segments are detected:
// Creates a full-page segment
{
  segment_type: "Page",
  bbox: { left: 0, top: 0, width: page_width, height: page_height },
  confidence: 1.0,
  // ... all page OCR results
}

Segmentation Quality

Confidence Scores

Each segment includes a confidence score from the detection model:
  • > 0.9: High confidence, very reliable
  • 0.7 - 0.9: Good confidence, usually accurate
  • 0.5 - 0.7: Medium confidence, may need review
  • < 0.5: Low confidence, likely false positive
Filter segments by confidence threshold if you need high-precision extraction:
const highConfidenceSegments = segments.filter(s => s.confidence > 0.8);

Accuracy Factors

Improves accuracy:
  • ✅ High-resolution images (high_resolution: true)
  • ✅ Clear, well-formatted documents
  • ✅ Standard layouts (papers, reports)
  • ✅ Good contrast and quality scans
May reduce accuracy:
  • ❌ Low-resolution or blurry images
  • ❌ Unusual layouts or designs
  • ❌ Heavily stylized documents
  • ❌ Poor scan quality

Batched Processing

Segmentation uses batched processing for efficiency:
// Pages are processed in batches
const batchSize = config.segmentation_batch_size;  // e.g., 10 pages

// Reduces API calls and improves throughput
batches.forEach(async (batch) => {
  const segments = await performSegmentationBatch(batch);
});
Benefits:
  • Faster processing for multi-page documents
  • Efficient resource utilization
  • Reduced network overhead

Configuration Examples

Academic Papers

{
  "segmentation_strategy": "LayoutAnalysis",
  "high_resolution": true,
  "segment_processing": {
    "Table": { "strategy": "LLM", "format": "Html" },
    "Formula": { "strategy": "LLM", "format": "Markdown" },
    "Picture": { "crop_image": "All" }
  }
}

Simple Documents

{
  "segmentation_strategy": "Page",
  "high_resolution": false,
  "segment_processing": {
    "Page": { "strategy": "Auto", "format": "Markdown" }
  }
}

Reports with Tables

{
  "segmentation_strategy": "LayoutAnalysis",
  "segment_processing": {
    "Table": { 
      "strategy": "LLM", 
      "format": "Html",
      "crop_image": "Auto"
    },
    "Text": { "strategy": "Auto", "format": "Markdown" }
  },
  "chunk_processing": {
    "ignore_headers_and_footers": true
  }
}

Error Handling

With error_handling: "Continue", segmentation failures fall back gracefully:
// If layout analysis fails
try {
  segments = await layoutAnalysis(page);
} catch (error) {
  console.log('Layout analysis failed, using page segmentation');
  // Falls back to full-page segment
  segments = [createPageSegment(page, ocrResults)];
}

Next Steps

OCR Strategies

Learn about text extraction methods

Segment Processing

Configure content generation per segment type

Chunking

Understand how segments are combined

API Reference

See complete API documentation

Build docs developers (and LLMs) love