Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/lumina-ai-inc/chunkr/llms.txt

Use this file to discover all available pages before exploring further.

OutputResponse

The OutputResponse object contains the processed results of a document analysis task.
chunks
Chunk[]
required
Collection of document chunks, where each chunk contains one or more segments. See Chunk below.
file_name
string
The name of the file.
page_count
integer
The number of pages in the file.
pdf_url
string
The presigned URL of the PDF file.
extracted_json
object
deprecated
DEPRECATED: The extracted JSON from the document.

Chunk

A Chunk represents a logical grouping of segments from the document. Chunks are created based on the target_length configuration.
chunk_id
string
required
The unique identifier for the chunk.
chunk_length
integer
required
The total number of tokens in the chunk. Calculated by the configured tokenizer.
segments
Segment[]
required
Collection of document segments that form this chunk.When target_chunk_length > 0, contains the maximum number of segments that fit within that length (segments remain intact). Otherwise, contains exactly one segment.See Segment below.
embed
string
Suggested text to be embedded for the chunk. This text is generated by combining the embed content from each segment according to the configured embed sources (HTML, Markdown, LLM, or Content).Can be configured using embed_sources in the SegmentProcessing configuration.

Segment

A Segment represents a logical element within a document page (e.g., title, paragraph, table, image).
segment_id
string
required
Unique identifier for the segment.
segment_type
SegmentType
required
The type of the segment. See Segment Types for all possible values.
bbox
BoundingBox
required
Bounding box coordinates for the segment.
page_number
integer
required
Page number of the segment (1-indexed).
page_width
number
required
Width of the page containing the segment.
page_height
number
required
Height of the page containing the segment.
content
string
required
Content of the segment, will be either HTML or Markdown, depending on the format chosen in segment processing configuration.
html
string
required
HTML representation of the segment.
markdown
string
required
Markdown representation of the segment.
text
string
required
Text content of the segment. Calculated from the OCR results.
llm
string
LLM-generated representation of the segment. Only present if LLM processing is configured for this segment type.
image
string
Presigned URL to the cropped image of the segment. Only present if cropping is enabled for this segment type.
confidence
number
Confidence score of the layout analysis model for this segment (0.0 to 1.0).
ocr
OCRResult[]
OCR results for the segment.

Example Response

{
  "chunks": [
    {
      "chunk_id": "550e8400-e29b-41d4-a716-446655440000",
      "chunk_length": 256,
      "segments": [
        {
          "segment_id": "660e8400-e29b-41d4-a716-446655440001",
          "segment_type": "Title",
          "bbox": {
            "left": 72.0,
            "top": 100.0,
            "width": 450.0,
            "height": 36.0
          },
          "page_number": 1,
          "page_width": 612.0,
          "page_height": 792.0,
          "content": "# Document Title",
          "html": "<h1>Document Title</h1>",
          "markdown": "# Document Title",
          "text": "Document Title",
          "confidence": 0.95
        }
      ],
      "embed": "# Document Title"
    }
  ],
  "file_name": "example.pdf",
  "page_count": 10,
  "pdf_url": "https://s3.amazonaws.com/..."
}

Build docs developers (and LLMs) love