Segments

Segment Types

Chunkr identifies the following segment types during layout analysis:

Title

SegmentType

Main title or heading of the document.

SectionHeader

SegmentType

Section headers and subheadings within the document.

Text

SegmentType

Regular paragraph text.

ListItem

SegmentType

Individual items in bulleted or numbered lists.

Table

SegmentType

Tabular data.

Picture

SegmentType

Images, diagrams, and figures.

Caption

SegmentType

Captions for images, tables, or other elements.

Formula

SegmentType

Mathematical formulas and equations.

Footnote

SegmentType

Footnotes and endnotes.

PageHeader

SegmentType

Headers that appear at the top of pages.

PageFooter

SegmentType

Footers that appear at the bottom of pages.

Page

SegmentType

An entire page treated as a single segment (when using Page segmentation strategy).

Segment Processing

The SegmentProcessing configuration allows you to control how each segment type is processed and which content representations are generated.

Configuration Structure

Each segment type can have its own processing configuration:

{
  "segment_processing": {
    "Title": { /* AutoGenerationConfig */ },
    "SectionHeader": { /* AutoGenerationConfig */ },
    "Text": { /* AutoGenerationConfig */ },
    "ListItem": { /* AutoGenerationConfig */ },
    "Table": { /* TableGenerationConfig */ },
    "Picture": { /* PictureGenerationConfig */ },
    "Caption": { /* AutoGenerationConfig */ },
    "Formula": { /* LlmGenerationConfig */ },
    "Footnote": { /* AutoGenerationConfig */ },
    "PageHeader": { /* AutoGenerationConfig */ },
    "PageFooter": { /* AutoGenerationConfig */ },
    "Page": { /* LlmGenerationConfig */ }
  }
}

AutoGenerationConfig

Used for most segment types (Title, SectionHeader, Text, ListItem, Caption, Footnote, PageHeader, PageFooter).

format

SegmentFormat

default:"Markdown"

Specifies the output format.

Show SegmentFormat options

Html - Generate HTML output
Markdown - Generate Markdown output

strategy

GenerationStrategy

default:"Auto"

Determines how the content is generated.

Show GenerationStrategy options

Auto - Use heuristics and rule-based generation
LLM - Use Chunkr’s fine-tuned models for generation

crop_image

CroppingStrategy

default:"Auto"

Controls whether to crop the page image to the segment’s bounding box.

Show CroppingStrategy options

All - Always crop images for this segment type
Auto - Only crop when needed for post-processing

llm

string

Custom prompt for LLM-based processing of this segment. Only used when LLM processing is enabled for the segment.

embed_sources

EmbedSource[]

default:"[Content]"

Defines which content sources will be included in the chunk’s embed field and counted towards the chunk length. The array’s order determines the sequence in which content appears.

Show EmbedSource options

Content - Use the primary content (HTML or Markdown based on format)
LLM - Use LLM-generated content
HTML - DEPRECATED: Use HTML representation
Markdown - DEPRECATED: Use Markdown representation

extended_context

boolean

default:false

Use the full page image as context for LLM generation.

Deprecated Fields

html

GenerationStrategy

deprecated

DEPRECATED: Use format: Html and strategy instead.

markdown

GenerationStrategy

deprecated

DEPRECATED: Use format: Markdown and strategy instead.

LlmGenerationConfig

Used for Formula and Page segment types. Has the same fields as AutoGenerationConfig but with strategy defaulting to LLM.

format

SegmentFormat

default:"Markdown"

Output format (Html or Markdown).

strategy

GenerationStrategy

default:"LLM"

Generation strategy (Auto or LLM).

crop_image

CroppingStrategy

default:"Auto"

Image cropping strategy.

llm

string

Custom LLM prompt.

embed_sources

EmbedSource[]

default:"[Content]"

Content sources for embedding.

extended_context

boolean

default:false

Use full page image as context.

TableGenerationConfig

Used specifically for Table segments. Has the same fields as AutoGenerationConfig but with different defaults.

format

SegmentFormat

default:"Html"

Output format (Html or Markdown). Tables default to HTML for better structure preservation.

strategy

GenerationStrategy

default:"LLM"

Generation strategy. Tables default to LLM for higher accuracy.

crop_image

CroppingStrategy

default:"Auto"

Image cropping strategy.

llm

string

Custom LLM prompt.

embed_sources

EmbedSource[]

default:"[Content]"

Content sources for embedding.

extended_context

boolean

default:false

Use full page image as context.

PictureGenerationConfig

Used specifically for Picture segments.

format

SegmentFormat

default:"Markdown"

Output format (Html or Markdown).

strategy

GenerationStrategy

default:"Auto"

Generation strategy.When set to Auto, generates image tags:

HTML format: <img src="{url}" />
Markdown format: ![Image]({url})

crop_image

PictureCroppingStrategy

default:"All"

Controls image cropping for pictures.

Show PictureCroppingStrategy options

All - Always crop picture images (default for pictures)
Auto - Only crop when needed for post-processing

llm

string

Custom LLM prompt for describing or processing the image.

embed_sources

EmbedSource[]

default:"[Content]"

Content sources for embedding.

extended_context

boolean

default:false

Use full page image as context.

Example Configuration

{
  "segment_processing": {
    "Text": {
      "format": "Markdown",
      "strategy": "Auto",
      "crop_image": "Auto",
      "embed_sources": ["Content"]
    },
    "Table": {
      "format": "Html",
      "strategy": "LLM",
      "crop_image": "All",
      "embed_sources": ["Content", "LLM"]
    },
    "Picture": {
      "format": "Markdown",
      "strategy": "Auto",
      "crop_image": "All",
      "embed_sources": ["Content"]
    },
    "Formula": {
      "format": "Markdown",
      "strategy": "LLM",
      "crop_image": "Auto",
      "embed_sources": ["LLM"],
      "llm": "Convert this formula to LaTeX notation"
    }
  }
}

Overview

Tasks

Models

Segment Types

Segment Processing

Configuration Structure

AutoGenerationConfig

Deprecated Fields

LlmGenerationConfig

TableGenerationConfig

PictureGenerationConfig

Example Configuration

Build docs developers (and LLMs) love

Overview

Tasks

Models

Documentation Index

​Segment Types

​Segment Processing

​Configuration Structure

​AutoGenerationConfig

​Deprecated Fields

​LlmGenerationConfig

​TableGenerationConfig

​PictureGenerationConfig

​Example Configuration

Build docs developers (and LLMs) love

Segment Types

Segment Processing

Configuration Structure

AutoGenerationConfig

Deprecated Fields

LlmGenerationConfig

TableGenerationConfig

PictureGenerationConfig

Example Configuration