Documentation Index
Fetch the complete documentation index at: https://mintlify.com/lumina-ai-inc/chunkr/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Chunkr outputs processed documents in a structured JSON format containing chunks, segments, and metadata. Each segment can be formatted as HTML, Markdown, or plain text based on your configuration.
Output Structure
The output response follows this structure:
{
"chunks": [
{
"chunk_id": "uuid",
"chunk_length": 245,
"segments": [...],
"embed": "Combined embed text from all segments"
}
],
"file_name": "document.pdf",
"page_count": 10,
"pdf_url": "https://presigned-url-to-pdf"
}
Output Response Fields
Array of chunks, where segments are grouped according to chunk_processing.target_length.
Original filename of the processed document.
Total number of pages in the document.
Presigned URL to download the processed PDF file.
Chunk Structure
Each chunk contains:
Unique identifier for the chunk (UUID).
Total number of tokens in the chunk, calculated using the configured tokenizer.
Array of segments contained in this chunk. Segments are never split - they always remain intact.
Suggested text for embedding this chunk. Generated by combining content from all segments according to the configured embed_sources.
Example Chunk
{
"chunk_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"chunk_length": 245,
"segments": [
{
"segment_id": "seg-1",
"segment_type": "Title",
"content": "# Introduction",
"text": "Introduction"
},
{
"segment_id": "seg-2",
"segment_type": "Text",
"content": "This document discusses...",
"text": "This document discusses..."
}
],
"embed": "# Introduction\n\nThis document discusses..."
}
Segment Structure
Each segment contains detailed information about a detected document element:
Unique identifier for the segment (UUID).
Type of segment: Title, SectionHeader, Text, ListItem, Table, Picture, Caption, Formula, Footnote, PageHeader, PageFooter, or Page.
Primary formatted content based on the format setting (HTML or Markdown).
Raw OCR text content without formatting.
HTML formatted content (deprecated - use content with format: "Html").
Markdown formatted content (deprecated - use content with format: "Markdown").
LLM-generated content if using strategy: "LLM" or custom prompts.
Presigned URL to the cropped image if crop_image is enabled.
Bounding box coordinates: {left, top, width, height}.
Page number where the segment appears (1-indexed).
Width of the page containing this segment.
Height of the page containing this segment.
Confidence score from the layout analysis model (0-1).
Array of OCR results with word-level bounding boxes and confidence scores.
Example Segment
Text Segment
Table Segment
Picture Segment
Formula Segment
{
"segment_id": "123e4567-e89b-12d3-a456-426614174000",
"segment_type": "Text",
"content": "This is a paragraph of text with **bold** formatting.",
"text": "This is a paragraph of text with bold formatting.",
"html": "<p>This is a paragraph of text with <strong>bold</strong> formatting.</p>",
"markdown": "This is a paragraph of text with **bold** formatting.",
"llm": null,
"image": null,
"bbox": {
"left": 72.0,
"top": 144.0,
"width": 468.0,
"height": 24.0
},
"page_number": 1,
"page_width": 612.0,
"page_height": 792.0,
"confidence": 0.98,
"ocr": [
{
"text": "This",
"bbox": {"left": 72.0, "top": 144.0, "width": 20.0, "height": 12.0},
"confidence": 0.99
}
]
}
{
"segment_id": "table-123",
"segment_type": "Table",
"content": "<table><tr><th>Name</th><th>Value</th></tr><tr><td>A</td><td>1</td></tr></table>",
"text": "Name Value A 1",
"html": "<table><tr><th>Name</th><th>Value</th></tr><tr><td>A</td><td>1</td></tr></table>",
"markdown": "| Name | Value |\n| --- | --- |\n| A | 1 |",
"llm": null,
"image": "https://presigned-url/table-image.jpg",
"bbox": {
"left": 72.0,
"top": 200.0,
"width": 300.0,
"height": 100.0
},
"page_number": 2,
"page_width": 612.0,
"page_height": 792.0,
"confidence": 0.95,
"ocr": [...]
}
{
"segment_id": "pic-456",
"segment_type": "Picture",
"content": "",
"text": "",
"html": "<img src='https://presigned-url/image.jpg' />",
"markdown": "",
"llm": "A bar chart showing quarterly sales data with an upward trend from Q1 to Q4.",
"image": "https://presigned-url/image.jpg",
"bbox": {
"left": 100.0,
"top": 300.0,
"width": 400.0,
"height": 300.0
},
"page_number": 3,
"page_width": 612.0,
"page_height": 792.0,
"confidence": 0.97,
"ocr": null
}
{
"segment_id": "formula-789",
"segment_type": "Formula",
"content": "$E = mc^2$",
"text": "E = mc2",
"html": "<span class='formula'>$E = mc^2$</span>",
"markdown": "$E = mc^2$",
"llm": "$E = mc^2$",
"image": "https://presigned-url/formula.jpg",
"bbox": {
"left": 150.0,
"top": 400.0,
"width": 100.0,
"height": 30.0
},
"page_number": 4,
"page_width": 612.0,
"page_height": 792.0,
"confidence": 0.93,
"ocr": [...]
}
Content Format Options
When configured with format: "Html":
{
"segment_processing": {
"table": {
"format": "Html"
}
}
}
Output:
content field contains HTML
- Preserves structure with proper tags
- Best for tables and complex layouts
- Can be rendered directly in browsers
Example:
<table>
<tr>
<th>Header 1</th>
<th>Header 2</th>
</tr>
<tr>
<td>Cell 1</td>
<td>Cell 2</td>
</tr>
</table>
When configured with format: "Markdown":
{
"segment_processing": {
"text": {
"format": "Markdown"
}
}
}
Output:
content field contains Markdown
- Human-readable
- Easy to convert to other formats
- Ideal for text-heavy content
Example:
# Heading
This is a paragraph with **bold** and *italic* text.
- List item 1
- List item 2
| Column 1 | Column 2 |
| --- | --- |
| Value 1 | Value 2 |
Plain Text
The text field always contains plain text extracted via OCR:
{
"text": "This is plain text without any formatting."
}
Accessing Output
Getting Task Results
curl -X GET "https://api.chunkr.ai/api/v1/task/{task_id}" \
-H "Authorization: Bearer YOUR_API_KEY"
Query Parameters:
include_chunks=true - Include full chunk data (default: true)
base64_urls=false - Use presigned URLs vs base64 (default: false)
from chunkr_ai import Chunkr
chunkr = Chunkr(api_key="YOUR_API_KEY")
# Get task with chunks
task = chunkr.get_task(task_id)
# Access output
for chunk in task.output.chunks:
print(f"Chunk {chunk.chunk_id}:")
print(f" Length: {chunk.chunk_length} tokens")
print(f" Embed: {chunk.embed[:100]}...")
for segment in chunk.segments:
print(f" - {segment.segment_type}: {segment.content[:50]}...")
Downloading Files
All file URLs are presigned and expire after a certain time:
import requests
# Download PDF
response = requests.get(task.output.pdf_url)
with open("output.pdf", "wb") as f:
f.write(response.content)
# Download segment image
for chunk in task.output.chunks:
for segment in chunk.segments:
if segment.image:
img_response = requests.get(segment.image)
with open(f"{segment.segment_id}.jpg", "wb") as f:
f.write(img_response.content)
JSON Export
The default output format:
import json
# Save full output
with open("output.json", "w") as f:
json.dump(task.output.model_dump(), f, indent=2)
Markdown Export
Combine all Markdown segments:
markdown_content = []
for chunk in task.output.chunks:
for segment in chunk.segments:
if segment.content:
markdown_content.append(segment.content)
with open("output.md", "w") as f:
f.write("\n\n".join(markdown_content))
HTML Export
Combine all HTML segments:
html_template = """
<!DOCTYPE html>
<html>
<head>
<title>{file_name}</title>
<style>
body {{ font-family: Arial, sans-serif; max-width: 800px; margin: 0 auto; }}
table {{ border-collapse: collapse; width: 100%; }}
th, td {{ border: 1px solid #ddd; padding: 8px; }}
</style>
</head>
<body>
{content}
</body>
</html>
"""
html_segments = []
for chunk in task.output.chunks:
for segment in chunk.segments:
if segment.html:
html_segments.append(segment.html)
html_output = html_template.format(
file_name=task.output.file_name,
content="\n".join(html_segments)
)
with open("output.html", "w") as f:
f.write(html_output)
Embedding-Ready Export
Extract just the embed fields for vector database ingestion:
embeddings = []
for chunk in task.output.chunks:
embeddings.append({
"id": chunk.chunk_id,
"text": chunk.embed,
"metadata": {
"file_name": task.output.file_name,
"chunk_length": chunk.chunk_length,
"segment_types": [s.segment_type for s in chunk.segments]
}
})
# Ready for vector database
import json
with open("embeddings.jsonl", "w") as f:
for embedding in embeddings:
f.write(json.dumps(embedding) + "\n")
Best Practices
-
Choose the right format for your use case
- HTML for tables, structured content, and web rendering
- Markdown for documentation, text processing, and readability
- Plain text for search indexing and simple analysis
-
Use embed fields for RAG
- Pre-configured based on
embed_sources
- Optimized for vector embeddings
- Includes only relevant content
-
Handle presigned URLs properly
- URLs expire after a set time
- Download and cache files you need to keep
- Don’t store presigned URLs long-term
-
Process chunks efficiently
- Iterate through chunks for large documents
- Use chunk metadata for filtering
- Chunk IDs are stable across requests
-
Leverage segment metadata
- Use
segment_type for filtering
- Check
confidence for quality control
- Use
bbox for spatial analysis
Complete Output Example
{
"chunks": [
{
"chunk_id": "chunk-1",
"chunk_length": 156,
"segments": [
{
"segment_id": "seg-1",
"segment_type": "Title",
"content": "# Document Intelligence API",
"text": "Document Intelligence API",
"bbox": {"left": 72, "top": 72, "width": 468, "height": 36},
"page_number": 1,
"confidence": 0.99
},
{
"segment_id": "seg-2",
"segment_type": "Text",
"content": "Process documents with state-of-the-art AI models.",
"text": "Process documents with state-of-the-art AI models.",
"bbox": {"left": 72, "top": 120, "width": 468, "height": 24},
"page_number": 1,
"confidence": 0.97
}
],
"embed": "# Document Intelligence API\n\nProcess documents with state-of-the-art AI models."
}
],
"file_name": "document.pdf",
"page_count": 5,
"pdf_url": "https://storage.example.com/processed/document.pdf?expires=..."
}