Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/lumina-ai-inc/chunkr/llms.txt

Use this file to discover all available pages before exploring further.

Segment Types

Chunkr identifies the following segment types during layout analysis:
Title
SegmentType
Main title or heading of the document.
SectionHeader
SegmentType
Section headers and subheadings within the document.
Text
SegmentType
Regular paragraph text.
ListItem
SegmentType
Individual items in bulleted or numbered lists.
Table
SegmentType
Tabular data.
Picture
SegmentType
Images, diagrams, and figures.
Caption
SegmentType
Captions for images, tables, or other elements.
Formula
SegmentType
Mathematical formulas and equations.
Footnote
SegmentType
Footnotes and endnotes.
PageHeader
SegmentType
Headers that appear at the top of pages.
Footers that appear at the bottom of pages.
Page
SegmentType
An entire page treated as a single segment (when using Page segmentation strategy).

Segment Processing

The SegmentProcessing configuration allows you to control how each segment type is processed and which content representations are generated.

Configuration Structure

Each segment type can have its own processing configuration:
{
  "segment_processing": {
    "Title": { /* AutoGenerationConfig */ },
    "SectionHeader": { /* AutoGenerationConfig */ },
    "Text": { /* AutoGenerationConfig */ },
    "ListItem": { /* AutoGenerationConfig */ },
    "Table": { /* TableGenerationConfig */ },
    "Picture": { /* PictureGenerationConfig */ },
    "Caption": { /* AutoGenerationConfig */ },
    "Formula": { /* LlmGenerationConfig */ },
    "Footnote": { /* AutoGenerationConfig */ },
    "PageHeader": { /* AutoGenerationConfig */ },
    "PageFooter": { /* AutoGenerationConfig */ },
    "Page": { /* LlmGenerationConfig */ }
  }
}

AutoGenerationConfig

Used for most segment types (Title, SectionHeader, Text, ListItem, Caption, Footnote, PageHeader, PageFooter).
format
SegmentFormat
default:"Markdown"
Specifies the output format.
strategy
GenerationStrategy
default:"Auto"
Determines how the content is generated.
crop_image
CroppingStrategy
default:"Auto"
Controls whether to crop the page image to the segment’s bounding box.
llm
string
Custom prompt for LLM-based processing of this segment. Only used when LLM processing is enabled for the segment.
embed_sources
EmbedSource[]
default:"[Content]"
Defines which content sources will be included in the chunk’s embed field and counted towards the chunk length. The array’s order determines the sequence in which content appears.
extended_context
boolean
default:false
Use the full page image as context for LLM generation.

Deprecated Fields

html
GenerationStrategy
deprecated
DEPRECATED: Use format: Html and strategy instead.
markdown
GenerationStrategy
deprecated
DEPRECATED: Use format: Markdown and strategy instead.

LlmGenerationConfig

Used for Formula and Page segment types. Has the same fields as AutoGenerationConfig but with strategy defaulting to LLM.
format
SegmentFormat
default:"Markdown"
Output format (Html or Markdown).
strategy
GenerationStrategy
default:"LLM"
Generation strategy (Auto or LLM).
crop_image
CroppingStrategy
default:"Auto"
Image cropping strategy.
llm
string
Custom LLM prompt.
embed_sources
EmbedSource[]
default:"[Content]"
Content sources for embedding.
extended_context
boolean
default:false
Use full page image as context.

TableGenerationConfig

Used specifically for Table segments. Has the same fields as AutoGenerationConfig but with different defaults.
format
SegmentFormat
default:"Html"
Output format (Html or Markdown). Tables default to HTML for better structure preservation.
strategy
GenerationStrategy
default:"LLM"
Generation strategy. Tables default to LLM for higher accuracy.
crop_image
CroppingStrategy
default:"Auto"
Image cropping strategy.
llm
string
Custom LLM prompt.
embed_sources
EmbedSource[]
default:"[Content]"
Content sources for embedding.
extended_context
boolean
default:false
Use full page image as context.

PictureGenerationConfig

Used specifically for Picture segments.
format
SegmentFormat
default:"Markdown"
Output format (Html or Markdown).
strategy
GenerationStrategy
default:"Auto"
Generation strategy.When set to Auto, generates image tags:
  • HTML format: <img src="{url}" />
  • Markdown format: ![Image]({url})
crop_image
PictureCroppingStrategy
default:"All"
Controls image cropping for pictures.
llm
string
Custom LLM prompt for describing or processing the image.
embed_sources
EmbedSource[]
default:"[Content]"
Content sources for embedding.
extended_context
boolean
default:false
Use full page image as context.

Example Configuration

{
  "segment_processing": {
    "Text": {
      "format": "Markdown",
      "strategy": "Auto",
      "crop_image": "Auto",
      "embed_sources": ["Content"]
    },
    "Table": {
      "format": "Html",
      "strategy": "LLM",
      "crop_image": "All",
      "embed_sources": ["Content", "LLM"]
    },
    "Picture": {
      "format": "Markdown",
      "strategy": "Auto",
      "crop_image": "All",
      "embed_sources": ["Content"]
    },
    "Formula": {
      "format": "Markdown",
      "strategy": "LLM",
      "crop_image": "Auto",
      "embed_sources": ["LLM"],
      "llm": "Convert this formula to LaTeX notation"
    }
  }
}

Build docs developers (and LLMs) love