Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/lumina-ai-inc/chunkr/llms.txt

Use this file to discover all available pages before exploring further.

Overview

Chunkr outputs processed documents in a structured JSON format containing chunks, segments, and metadata. Each segment can be formatted as HTML, Markdown, or plain text based on your configuration.

Output Structure

The output response follows this structure:
{
  "chunks": [
    {
      "chunk_id": "uuid",
      "chunk_length": 245,
      "segments": [...],
      "embed": "Combined embed text from all segments"
    }
  ],
  "file_name": "document.pdf",
  "page_count": 10,
  "pdf_url": "https://presigned-url-to-pdf"
}

Output Response Fields

chunks
array
Array of chunks, where segments are grouped according to chunk_processing.target_length.
file_name
string
Original filename of the processed document.
page_count
integer
Total number of pages in the document.
pdf_url
string
Presigned URL to download the processed PDF file.

Chunk Structure

Each chunk contains:
chunk_id
string
Unique identifier for the chunk (UUID).
chunk_length
integer
Total number of tokens in the chunk, calculated using the configured tokenizer.
segments
array
Array of segments contained in this chunk. Segments are never split - they always remain intact.
embed
string
Suggested text for embedding this chunk. Generated by combining content from all segments according to the configured embed_sources.

Example Chunk

{
  "chunk_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "chunk_length": 245,
  "segments": [
    {
      "segment_id": "seg-1",
      "segment_type": "Title",
      "content": "# Introduction",
      "text": "Introduction"
    },
    {
      "segment_id": "seg-2",
      "segment_type": "Text",
      "content": "This document discusses...",
      "text": "This document discusses..."
    }
  ],
  "embed": "# Introduction\n\nThis document discusses..."
}

Segment Structure

Each segment contains detailed information about a detected document element:
segment_id
string
Unique identifier for the segment (UUID).
segment_type
enum
Type of segment: Title, SectionHeader, Text, ListItem, Table, Picture, Caption, Formula, Footnote, PageHeader, PageFooter, or Page.
content
string
Primary formatted content based on the format setting (HTML or Markdown).
text
string
Raw OCR text content without formatting.
html
string
HTML formatted content (deprecated - use content with format: "Html").
markdown
string
Markdown formatted content (deprecated - use content with format: "Markdown").
llm
string
LLM-generated content if using strategy: "LLM" or custom prompts.
image
string
Presigned URL to the cropped image if crop_image is enabled.
bbox
object
Bounding box coordinates: {left, top, width, height}.
page_number
integer
Page number where the segment appears (1-indexed).
page_width
float
Width of the page containing this segment.
page_height
float
Height of the page containing this segment.
confidence
float
Confidence score from the layout analysis model (0-1).
ocr
array
Array of OCR results with word-level bounding boxes and confidence scores.

Example Segment

{
  "segment_id": "123e4567-e89b-12d3-a456-426614174000",
  "segment_type": "Text",
  "content": "This is a paragraph of text with **bold** formatting.",
  "text": "This is a paragraph of text with bold formatting.",
  "html": "<p>This is a paragraph of text with <strong>bold</strong> formatting.</p>",
  "markdown": "This is a paragraph of text with **bold** formatting.",
  "llm": null,
  "image": null,
  "bbox": {
    "left": 72.0,
    "top": 144.0,
    "width": 468.0,
    "height": 24.0
  },
  "page_number": 1,
  "page_width": 612.0,
  "page_height": 792.0,
  "confidence": 0.98,
  "ocr": [
    {
      "text": "This",
      "bbox": {"left": 72.0, "top": 144.0, "width": 20.0, "height": 12.0},
      "confidence": 0.99
    }
  ]
}

Content Format Options

HTML Format

When configured with format: "Html":
{
  "segment_processing": {
    "table": {
      "format": "Html"
    }
  }
}
Output:
  • content field contains HTML
  • Preserves structure with proper tags
  • Best for tables and complex layouts
  • Can be rendered directly in browsers
Example:
<table>
  <tr>
    <th>Header 1</th>
    <th>Header 2</th>
  </tr>
  <tr>
    <td>Cell 1</td>
    <td>Cell 2</td>
  </tr>
</table>

Markdown Format

When configured with format: "Markdown":
{
  "segment_processing": {
    "text": {
      "format": "Markdown"
    }
  }
}
Output:
  • content field contains Markdown
  • Human-readable
  • Easy to convert to other formats
  • Ideal for text-heavy content
Example:
# Heading

This is a paragraph with **bold** and *italic* text.

- List item 1
- List item 2

| Column 1 | Column 2 |
| --- | --- |
| Value 1 | Value 2 |

Plain Text

The text field always contains plain text extracted via OCR:
{
  "text": "This is plain text without any formatting."
}

Accessing Output

Getting Task Results

curl -X GET "https://api.chunkr.ai/api/v1/task/{task_id}" \
  -H "Authorization: Bearer YOUR_API_KEY"
Query Parameters:
  • include_chunks=true - Include full chunk data (default: true)
  • base64_urls=false - Use presigned URLs vs base64 (default: false)

Downloading Files

All file URLs are presigned and expire after a certain time:
import requests

# Download PDF
response = requests.get(task.output.pdf_url)
with open("output.pdf", "wb") as f:
    f.write(response.content)

# Download segment image
for chunk in task.output.chunks:
    for segment in chunk.segments:
        if segment.image:
            img_response = requests.get(segment.image)
            with open(f"{segment.segment_id}.jpg", "wb") as f:
                f.write(img_response.content)

Export Formats

JSON Export

The default output format:
import json

# Save full output
with open("output.json", "w") as f:
    json.dump(task.output.model_dump(), f, indent=2)

Markdown Export

Combine all Markdown segments:
markdown_content = []

for chunk in task.output.chunks:
    for segment in chunk.segments:
        if segment.content:
            markdown_content.append(segment.content)

with open("output.md", "w") as f:
    f.write("\n\n".join(markdown_content))

HTML Export

Combine all HTML segments:
html_template = """
<!DOCTYPE html>
<html>
<head>
    <title>{file_name}</title>
    <style>
        body {{ font-family: Arial, sans-serif; max-width: 800px; margin: 0 auto; }}
        table {{ border-collapse: collapse; width: 100%; }}
        th, td {{ border: 1px solid #ddd; padding: 8px; }}
    </style>
</head>
<body>
    {content}
</body>
</html>
"""

html_segments = []
for chunk in task.output.chunks:
    for segment in chunk.segments:
        if segment.html:
            html_segments.append(segment.html)

html_output = html_template.format(
    file_name=task.output.file_name,
    content="\n".join(html_segments)
)

with open("output.html", "w") as f:
    f.write(html_output)

Embedding-Ready Export

Extract just the embed fields for vector database ingestion:
embeddings = []

for chunk in task.output.chunks:
    embeddings.append({
        "id": chunk.chunk_id,
        "text": chunk.embed,
        "metadata": {
            "file_name": task.output.file_name,
            "chunk_length": chunk.chunk_length,
            "segment_types": [s.segment_type for s in chunk.segments]
        }
    })

# Ready for vector database
import json
with open("embeddings.jsonl", "w") as f:
    for embedding in embeddings:
        f.write(json.dumps(embedding) + "\n")

Best Practices

  1. Choose the right format for your use case
    • HTML for tables, structured content, and web rendering
    • Markdown for documentation, text processing, and readability
    • Plain text for search indexing and simple analysis
  2. Use embed fields for RAG
    • Pre-configured based on embed_sources
    • Optimized for vector embeddings
    • Includes only relevant content
  3. Handle presigned URLs properly
    • URLs expire after a set time
    • Download and cache files you need to keep
    • Don’t store presigned URLs long-term
  4. Process chunks efficiently
    • Iterate through chunks for large documents
    • Use chunk metadata for filtering
    • Chunk IDs are stable across requests
  5. Leverage segment metadata
    • Use segment_type for filtering
    • Check confidence for quality control
    • Use bbox for spatial analysis

Complete Output Example

{
  "chunks": [
    {
      "chunk_id": "chunk-1",
      "chunk_length": 156,
      "segments": [
        {
          "segment_id": "seg-1",
          "segment_type": "Title",
          "content": "# Document Intelligence API",
          "text": "Document Intelligence API",
          "bbox": {"left": 72, "top": 72, "width": 468, "height": 36},
          "page_number": 1,
          "confidence": 0.99
        },
        {
          "segment_id": "seg-2",
          "segment_type": "Text",
          "content": "Process documents with state-of-the-art AI models.",
          "text": "Process documents with state-of-the-art AI models.",
          "bbox": {"left": 72, "top": 120, "width": 468, "height": 24},
          "page_number": 1,
          "confidence": 0.97
        }
      ],
      "embed": "# Document Intelligence API\n\nProcess documents with state-of-the-art AI models."
    }
  ],
  "file_name": "document.pdf",
  "page_count": 5,
  "pdf_url": "https://storage.example.com/processed/document.pdf?expires=..."
}

Build docs developers (and LLMs) love