Output Formats

Overview

Chunkr outputs processed documents in a structured JSON format containing chunks, segments, and metadata. Each segment can be formatted as HTML, Markdown, or plain text based on your configuration.

Output Structure

The output response follows this structure:

{
  "chunks": [
    {
      "chunk_id": "uuid",
      "chunk_length": 245,
      "segments": [...],
      "embed": "Combined embed text from all segments"
    }
  ],
  "file_name": "document.pdf",
  "page_count": 10,
  "pdf_url": "https://presigned-url-to-pdf"
}

Output Response Fields

chunks

array

Array of chunks, where segments are grouped according to chunk_processing.target_length.

file_name

string

Original filename of the processed document.

page_count

integer

Total number of pages in the document.

pdf_url

string

Presigned URL to download the processed PDF file.

Chunk Structure

Each chunk contains:

chunk_id

string

Unique identifier for the chunk (UUID).

chunk_length

integer

Total number of tokens in the chunk, calculated using the configured tokenizer.

segments

array

Array of segments contained in this chunk. Segments are never split - they always remain intact.

embed

string

Suggested text for embedding this chunk. Generated by combining content from all segments according to the configured embed_sources.

Example Chunk

{
  "chunk_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "chunk_length": 245,
  "segments": [
    {
      "segment_id": "seg-1",
      "segment_type": "Title",
      "content": "# Introduction",
      "text": "Introduction"
    },
    {
      "segment_id": "seg-2",
      "segment_type": "Text",
      "content": "This document discusses...",
      "text": "This document discusses..."
    }
  ],
  "embed": "# Introduction\n\nThis document discusses..."
}

Segment Structure

Each segment contains detailed information about a detected document element:

segment_id

string

Unique identifier for the segment (UUID).

segment_type

enum

Type of segment: Title, SectionHeader, Text, ListItem, Table, Picture, Caption, Formula, Footnote, PageHeader, PageFooter, or Page.

content

string

Primary formatted content based on the format setting (HTML or Markdown).

text

string

Raw OCR text content without formatting.

html

string

HTML formatted content (deprecated - use content with format: "Html").

markdown

string

Markdown formatted content (deprecated - use content with format: "Markdown").

llm

string

LLM-generated content if using strategy: "LLM" or custom prompts.

image

string

Presigned URL to the cropped image if crop_image is enabled.

bbox

object

Bounding box coordinates: {left, top, width, height}.

page_number

integer

Page number where the segment appears (1-indexed).

page_width

float

Width of the page containing this segment.

page_height

float

Height of the page containing this segment.

confidence

float

Confidence score from the layout analysis model (0-1).

ocr

array

Array of OCR results with word-level bounding boxes and confidence scores.

Example Segment

Text Segment
Table Segment
Picture Segment
Formula Segment

{
  "segment_id": "123e4567-e89b-12d3-a456-426614174000",
  "segment_type": "Text",
  "content": "This is a paragraph of text with **bold** formatting.",
  "text": "This is a paragraph of text with bold formatting.",
  "html": "<p>This is a paragraph of text with <strong>bold</strong> formatting.</p>",
  "markdown": "This is a paragraph of text with **bold** formatting.",
  "llm": null,
  "image": null,
  "bbox": {
    "left": 72.0,
    "top": 144.0,
    "width": 468.0,
    "height": 24.0
  },
  "page_number": 1,
  "page_width": 612.0,
  "page_height": 792.0,
  "confidence": 0.98,
  "ocr": [
    {
      "text": "This",
      "bbox": {"left": 72.0, "top": 144.0, "width": 20.0, "height": 12.0},
      "confidence": 0.99
    }
  ]
}

{
  "segment_id": "table-123",
  "segment_type": "Table",
  "content": "<table><tr><th>Name</th><th>Value</th></tr><tr><td>A</td><td>1</td></tr></table>",
  "text": "Name Value A 1",
  "html": "<table><tr><th>Name</th><th>Value</th></tr><tr><td>A</td><td>1</td></tr></table>",
  "markdown": "| Name | Value |\n| --- | --- |\n| A | 1 |",
  "llm": null,
  "image": "https://presigned-url/table-image.jpg",
  "bbox": {
    "left": 72.0,
    "top": 200.0,
    "width": 300.0,
    "height": 100.0
  },
  "page_number": 2,
  "page_width": 612.0,
  "page_height": 792.0,
  "confidence": 0.95,
  "ocr": [...]
}

{
  "segment_id": "pic-456",
  "segment_type": "Picture",
  "content": "![Image](https://presigned-url/image.jpg)",
  "text": "",
  "html": "<img src='https://presigned-url/image.jpg' />",
  "markdown": "![Image](https://presigned-url/image.jpg)",
  "llm": "A bar chart showing quarterly sales data with an upward trend from Q1 to Q4.",
  "image": "https://presigned-url/image.jpg",
  "bbox": {
    "left": 100.0,
    "top": 300.0,
    "width": 400.0,
    "height": 300.0
  },
  "page_number": 3,
  "page_width": 612.0,
  "page_height": 792.0,
  "confidence": 0.97,
  "ocr": null
}

{
  "segment_id": "formula-789",
  "segment_type": "Formula",
  "content": "$E = mc^2$",
  "text": "E = mc2",
  "html": "<span class='formula'>$E = mc^2$</span>",
  "markdown": "$E = mc^2$",
  "llm": "$E = mc^2$",
  "image": "https://presigned-url/formula.jpg",
  "bbox": {
    "left": 150.0,
    "top": 400.0,
    "width": 100.0,
    "height": 30.0
  },
  "page_number": 4,
  "page_width": 612.0,
  "page_height": 792.0,
  "confidence": 0.93,
  "ocr": [...]
}

Content Format Options

HTML Format

When configured with format: "Html":

{
  "segment_processing": {
    "table": {
      "format": "Html"
    }
  }
}

Output:

content field contains HTML
Preserves structure with proper tags
Best for tables and complex layouts
Can be rendered directly in browsers

Example:

<table>
  <tr>
    <th>Header 1</th>
    <th>Header 2</th>
  </tr>
  <tr>
    <td>Cell 1</td>
    <td>Cell 2</td>
  </tr>
</table>

Markdown Format

When configured with format: "Markdown":

{
  "segment_processing": {
    "text": {
      "format": "Markdown"
    }
  }
}

Output:

content field contains Markdown
Human-readable
Easy to convert to other formats
Ideal for text-heavy content

Example:

# Heading

This is a paragraph with **bold** and *italic* text.

- List item 1
- List item 2

| Column 1 | Column 2 |
| --- | --- |
| Value 1 | Value 2 |

Plain Text

The text field always contains plain text extracted via OCR:

{
  "text": "This is plain text without any formatting."
}

Accessing Output

Getting Task Results

REST API
Python SDK

curl -X GET "https://api.chunkr.ai/api/v1/task/{task_id}" \
  -H "Authorization: Bearer YOUR_API_KEY"

Query Parameters:

include_chunks=true - Include full chunk data (default: true)
base64_urls=false - Use presigned URLs vs base64 (default: false)

from chunkr_ai import Chunkr

chunkr = Chunkr(api_key="YOUR_API_KEY")

# Get task with chunks
task = chunkr.get_task(task_id)

# Access output
for chunk in task.output.chunks:
    print(f"Chunk {chunk.chunk_id}:")
    print(f"  Length: {chunk.chunk_length} tokens")
    print(f"  Embed: {chunk.embed[:100]}...")
    
    for segment in chunk.segments:
        print(f"  - {segment.segment_type}: {segment.content[:50]}...")

Downloading Files

All file URLs are presigned and expire after a certain time:

import requests

# Download PDF
response = requests.get(task.output.pdf_url)
with open("output.pdf", "wb") as f:
    f.write(response.content)

# Download segment image
for chunk in task.output.chunks:
    for segment in chunk.segments:
        if segment.image:
            img_response = requests.get(segment.image)
            with open(f"{segment.segment_id}.jpg", "wb") as f:
                f.write(img_response.content)

Export Formats

JSON Export

The default output format:

import json

# Save full output
with open("output.json", "w") as f:
    json.dump(task.output.model_dump(), f, indent=2)

Markdown Export

Combine all Markdown segments:

markdown_content = []

for chunk in task.output.chunks:
    for segment in chunk.segments:
        if segment.content:
            markdown_content.append(segment.content)

with open("output.md", "w") as f:
    f.write("\n\n".join(markdown_content))

HTML Export

Combine all HTML segments:

html_template = """
<!DOCTYPE html>
<html>
<head>
    <title>{file_name}</title>
    <style>
        body {{ font-family: Arial, sans-serif; max-width: 800px; margin: 0 auto; }}
        table {{ border-collapse: collapse; width: 100%; }}
        th, td {{ border: 1px solid #ddd; padding: 8px; }}
    </style>
</head>
<body>
    {content}
</body>
</html>
"""

html_segments = []
for chunk in task.output.chunks:
    for segment in chunk.segments:
        if segment.html:
            html_segments.append(segment.html)

html_output = html_template.format(
    file_name=task.output.file_name,
    content="\n".join(html_segments)
)

with open("output.html", "w") as f:
    f.write(html_output)

Embedding-Ready Export

Extract just the embed fields for vector database ingestion:

embeddings = []

for chunk in task.output.chunks:
    embeddings.append({
        "id": chunk.chunk_id,
        "text": chunk.embed,
        "metadata": {
            "file_name": task.output.file_name,
            "chunk_length": chunk.chunk_length,
            "segment_types": [s.segment_type for s in chunk.segments]
        }
    })

# Ready for vector database
import json
with open("embeddings.jsonl", "w") as f:
    for embedding in embeddings:
        f.write(json.dumps(embedding) + "\n")

Best Practices

Choose the right format for your use case
- HTML for tables, structured content, and web rendering
- Markdown for documentation, text processing, and readability
- Plain text for search indexing and simple analysis
Use embed fields for RAG
- Pre-configured based on embed_sources
- Optimized for vector embeddings
- Includes only relevant content
Handle presigned URLs properly
- URLs expire after a set time
- Download and cache files you need to keep
- Don’t store presigned URLs long-term
Process chunks efficiently
- Iterate through chunks for large documents
- Use chunk metadata for filtering
- Chunk IDs are stable across requests
Leverage segment metadata
- Use segment_type for filtering
- Check confidence for quality control
- Use bbox for spatial analysis

Complete Output Example

{
  "chunks": [
    {
      "chunk_id": "chunk-1",
      "chunk_length": 156,
      "segments": [
        {
          "segment_id": "seg-1",
          "segment_type": "Title",
          "content": "# Document Intelligence API",
          "text": "Document Intelligence API",
          "bbox": {"left": 72, "top": 72, "width": 468, "height": 36},
          "page_number": 1,
          "confidence": 0.99
        },
        {
          "segment_id": "seg-2",
          "segment_type": "Text",
          "content": "Process documents with state-of-the-art AI models.",
          "text": "Process documents with state-of-the-art AI models.",
          "bbox": {"left": 72, "top": 120, "width": 468, "height": 24},
          "page_number": 1,
          "confidence": 0.97
        }
      ],
      "embed": "# Document Intelligence API\n\nProcess documents with state-of-the-art AI models."
    }
  ],
  "file_name": "document.pdf",
  "page_count": 5,
  "pdf_url": "https://storage.example.com/processed/document.pdf?expires=..."
}

Getting Started

Core Concepts

Configuration

Deployment

Guides

Output Formats

Overview

Output Structure

Output Response Fields

Chunk Structure

Example Chunk

Segment Structure

Example Segment

Content Format Options

HTML Format

Markdown Format

Plain Text

Accessing Output

Getting Task Results

Downloading Files

Export Formats

JSON Export

Markdown Export

HTML Export

Embedding-Ready Export

Best Practices

Complete Output Example

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Configuration

Deployment

Guides

Documentation Index

​Overview

​Output Structure

​Output Response Fields

​Chunk Structure

​Example Chunk

​Segment Structure

​Example Segment

​Content Format Options

​HTML Format

​Markdown Format

​Plain Text

​Accessing Output

​Getting Task Results

​Downloading Files

​Export Formats

​JSON Export

​Markdown Export

​HTML Export

​Embedding-Ready Export

​Best Practices

​Complete Output Example

Build docs developers (and LLMs) love

Overview

Output Structure

Output Response Fields

Chunk Structure

Example Chunk

Segment Structure

Example Segment

Content Format Options

HTML Format

Markdown Format

Plain Text

Accessing Output

Getting Task Results

Downloading Files

Export Formats

JSON Export

Markdown Export

HTML Export

Embedding-Ready Export

Best Practices

Complete Output Example