Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/docling-project/docling/llms.txt

Use this file to discover all available pages before exploring further.

Overview

Docling’s document processing pipeline consists of multiple stages, each using specialized models and inference engines. This catalog provides:
  • Processing stages and their purposes
  • Model families and specific models
  • Inference engine compatibility
  • Usage examples and configuration
Source: ~/workspace/source/docs/usage/model_catalog.md:1

Processing Stages

Docling pipelines are composed of these processing stages:

Layout

Document structure detection

OCR

Optical character recognition

Table Structure

Table cell recognition

Picture Classifier

Image type classification

VLM Convert

Full page conversion with VLMs

Picture Description

Image captioning

Code & Formula

Code/math extraction

Layout Detection

Overview

Source: ~/workspace/source/docs/usage/model_catalog.md:26 Detects document elements (paragraphs, tables, figures, headers, etc.) using RT-DETR-based object detection. Model Family: Object Detection (RT-DETR based)
Inference Engine: docling-ibm-models
Supported Devices: CPU, CUDA, MPS, XPU

Available Models

Source: ~/workspace/source/docs/usage/model_catalog.md:30
ModelStatusDescription
docling-layout-heron⭐ DefaultRecommended for most use cases
docling-layout-heron-101-Enhanced variant of Heron
docling-layout-egret-medium-Medium-sized Egret model
docling-layout-egret-large-Larger Egret model
docling-layout-egret-xlarge-Extra-large Egret model
docling-layout-v2LegacyPrevious generation model

Usage

Source: ~/workspace/source/docs/usage/model_catalog.md:252
from docling.datamodel.pipeline_options import LayoutOptions
from docling.datamodel.layout_model_specs import DOCLING_LAYOUT_HERON

# Use Heron layout model (default)
layout_options = LayoutOptions(
    model_spec=DOCLING_LAYOUT_HERON
)

Output

Bounding boxes with element labels:
  • TEXT - Body text paragraphs
  • SECTION_HEADER - Section headings
  • TABLE - Tables
  • PICTURE - Images and figures
  • LIST_ITEM - List items
  • FORMULA - Mathematical formulas
  • PAGE_HEADER / PAGE_FOOTER - Headers/footers

OCR (Optical Character Recognition)

Overview

Source: ~/workspace/source/docs/usage/model_catalog.md:51 Extracts text from images and scanned documents using various OCR engines. Model Family: Multiple OCR Engines
Inference Engines: Engine-specific
Supported Devices: Varies by engine

Available Engines

Source: ~/workspace/source/docs/usage/model_catalog.md:206
OCR EngineBackendLanguagesGPU SupportNotes
AutoAutomaticVariesVariesAutomatically selects best available
TesseractCLI or Python100+NoMost widely used, good accuracy
EasyOCRPyTorch80+YesGPU-accelerated, good for Asian languages
RapidOCRONNX/OpenVINO/PaddleMultipleYes (torch)Fast, multiple backend options
macOS VisionNative macOS20+YesmacOS only, excellent quality
SuryaOCRPyTorch90+YesModern, good for complex layouts

Usage

Source: ~/workspace/source/docs/usage/model_catalog.md:286
from docling.datamodel.pipeline_options import (
    TesseractOcrOptions,
    RapidOcrOptions
)

# Tesseract with multiple languages
ocr_options = TesseractOcrOptions(
    lang=["eng", "deu"]  # English and German
)

# RapidOCR with GPU acceleration
ocr_options = RapidOcrOptions(
    backend="torch",  # GPU-accelerated
    lang=["en"]
)

Table Structure Recognition

TableFormer Models

Source: ~/workspace/source/docs/usage/model_catalog.md:70 Recognizes table structure (rows, columns, cells) and relationships. Model Family: TableFormer
Inference Engine: docling-ibm-models
Supported Devices: CPU, CUDA, XPU (MPS currently disabled)

Available Modes

Source: ~/workspace/source/docs/usage/model_catalog.md:74
ModeStatusSpeedAccuracy
Accurate⭐ DefaultSlowerHigher quality
Fast-FasterGood quality

Usage

Source: ~/workspace/source/docs/usage/model_catalog.md:263
from docling.datamodel.pipeline_options import (
    TableStructureOptions,
    TableFormerMode
)

# Use accurate mode for best quality
table_options = TableStructureOptions(
    mode=TableFormerMode.ACCURATE,
    do_cell_matching=True  # Align cells with content
)

Object Detection (WIP)

Source: ~/workspace/source/docs/usage/model_catalog.md:86 Alternative approach for table structure recognition using object detection.
Object detection-based table structure is work in progress.

Picture Classification

Overview

Source: ~/workspace/source/docs/usage/model_catalog.md:101 Classifies pictures into semantic categories (charts, diagrams, logos, etc.). Model Family: Image Classifier (Vision Transformer)
Inference Engine: Transformers (ViT)
Supported Devices: CPU, CUDA, MPS, XPU

Available Models

Source: ~/workspace/source/docs/usage/model_catalog.md:104
ModelStatusDescription
DocumentFigureClassifier-v2.0⭐ DefaultSpecialized for document imagery
Model Card: ds4sd/DocumentFigureClassifier

Supported Classes

  • Chart types (bar, line, pie, scatter)
  • Diagrams and flowcharts
  • Natural images
  • Logos and branding
  • Signatures
  • Technical illustrations

Usage

Source: ~/workspace/source/docs/usage/model_catalog.md:275
from docling.models.stages.picture_classifier.document_picture_classifier import (
    DocumentPictureClassifierOptions
)

# Use default picture classifier
classifier_options = DocumentPictureClassifierOptions.from_preset(
    "document_figure_classifier_v2"
)

VLM Convert (Full Page)

Overview

Source: ~/workspace/source/docs/usage/model_catalog.md:116 Converts entire document pages to structured formats using vision-language models. Model Family: Vision-Language Models
Output Formats: DocTags (structured), Markdown (human-readable)
Inference Engines: Transformers, MLX, API (Ollama, LM Studio, OpenAI), vLLM, AUTO_INLINE

Available Models

Source: ~/workspace/source/docs/usage/model_catalog.md:220
Preset IDModelSizeTransformersMLXAPIvLLMOutput
granite_doclingGranite-Docling-258M258MOllamaDocTags
smoldoclingSmolDocling-256M256MDocTags
deepseek_ocrDeepSeek-OCR-3B3BOllama, LM StudioMarkdown
granite_visionGranite-Vision-3.3-2B2BOllama, LM StudioMarkdown
pixtralPixtral-12B12BMarkdown
got_ocrGOT-OCR-2.0-Markdown
phi4Phi-4-Multimodal-Markdown
qwenQwen2.5-VL-3B3BMarkdown
gemma_12bGemma-3-12B12BMarkdown
gemma_27bGemma-3-27B27BMarkdown
dolphinDolphin-Markdown

Usage

Source: ~/workspace/source/docs/usage/model_catalog.md:294
from docling.datamodel.pipeline_options import VlmConvertOptions

# Use SmolDocling with auto-selected engine
options = VlmConvertOptions.from_preset("smoldocling")

# Force specific engine
from docling.datamodel.vlm_engine_options import MlxVlmEngineOptions

options = VlmConvertOptions.from_preset(
    "smoldocling",
    engine_options=MlxVlmEngineOptions()
)

Output Formats

DocTags: Structured XML-like format optimized for document understanding
<document>
  <section_header>Introduction</section_header>
  <text>This is a paragraph...</text>
  <table>
    <row><cell>Data</cell></row>
  </table>
</document>
Markdown: Human-readable format for general-purpose conversion
# Introduction

This is a paragraph...

| Column 1 | Column 2 |
|----------|----------|
| Data     | Data     |

Picture Description

Overview

Source: ~/workspace/source/docs/usage/model_catalog.md:143 Generates natural language descriptions (captions) of images and figures. Model Family: Vision-Language Models
Inference Engines: Transformers, MLX, API (Ollama, LM Studio), vLLM, AUTO_INLINE

Available Models

Source: ~/workspace/source/docs/usage/model_catalog.md:236
Preset IDModelSizeTransformersMLXAPIvLLM
smolvlmSmolVLM-256M256MLM Studio
granite_visionGranite-Vision-3.3-2B2BOllama, LM Studio
pixtralPixtral-12B12B
qwenQwen2.5-VL-3B3B

Usage

Source: ~/workspace/source/docs/usage/model_catalog.md:310
from docling.datamodel.pipeline_options import PictureDescriptionVlmOptions

# Use Granite Vision for detailed descriptions
options = PictureDescriptionVlmOptions.from_preset("granite_vision")

Code & Formula Extraction

Overview

Source: ~/workspace/source/docs/usage/model_catalog.md:161 Extracts and recognizes code blocks and mathematical formulas. Model Family: Vision-Language Models
Inference Engines: Transformers, MLX, AUTO_INLINE

Available Models

Source: ~/workspace/source/docs/usage/model_catalog.md:244
Preset IDModelTransformersMLX
codeformulav2CodeFormulaV2
granite_doclingGranite-Docling-258M
Model Card: ds4sd/CodeFormula

Usage

Source: ~/workspace/source/docs/usage/model_catalog.md:318
from docling.datamodel.pipeline_options import CodeFormulaVlmOptions

# Use specialized CodeFormulaV2 model
options = CodeFormulaVlmOptions.from_preset("codeformulav2")

Inference Engine Compatibility

Object Detection Models

Source: ~/workspace/source/docs/usage/model_catalog.md:182
StageEngineDevices
Layoutdocling-ibm-modelsCPU, CUDA, MPS, XPU
Table Structuredocling-ibm-modelsCPU, CUDA, XPU
MPS is currently disabled for TableFormer due to performance issues.

Vision-Language Models

Source: ~/workspace/source/docs/usage/model_catalog.md:220 VLM inference engines support varies by model:
  • Transformers: Direct HuggingFace transformers integration
  • MLX: Apple Silicon optimized (macOS only)
  • API: OpenAI-compatible endpoints (Ollama, LM Studio, vLLM)
  • vLLM: Linux-only high-performance server
  • AUTO_INLINE: Automatic engine selection

Model Selection Guide

Recommended: docling-layout-heron
  • Good balance of speed and accuracy
  • Suitable for most document types
  • Use Egret models for specialized needs
Recommended: Auto or Tesseract
  • Auto: Automatic engine selection
  • Tesseract: Reliable, widely supported
  • RapidOCR (torch): GPU acceleration needed
  • macOS Vision: Best quality on macOS
Recommended: Accurate mode
  • Use Accurate for production (better quality)
  • Use Fast for quick prototyping
  • Enable do_cell_matching for best results
Recommended: granite_docling or smoldocling
  • Granite Docling: Best for structured output (DocTags)
  • SmolDocling: Lightweight alternative
  • DeepSeek OCR: High-quality Markdown (API-only)
  • Larger models (Pixtral, Qwen) for complex documents
Recommended: smolvlm
  • SmolVLM: Fast, good quality, small size
  • Granite Vision: More detailed descriptions
  • Larger models for specialized captioning

Performance Characteristics

Model Sizes and Speed

Model TypeSize RangeTypical SpeedGPU Benefit
Layout Detection~100-500MBFastHigh
OCR EnginesVariesFast-MediumVaries
Table Structure~100MBMediumHigh
Picture Classifier~100MBFastMedium
Small VLMs (256M)~500MB-1GBFastHigh
Medium VLMs (2-3B)2-6GBMediumVery High
Large VLMs (12B+)12GB+SlowCritical

Device Recommendations

CPU Only

  • Layout: Heron
  • OCR: Tesseract/Auto
  • VLM: SmolVLM/SmolDocling (small models only)
  • Expect slower processing

NVIDIA GPU

  • All models supported
  • Use batch processing
  • Consider Flash Attention 2
  • Ideal for VLM pipelines with inference servers

Apple Silicon

  • Layout: All models via MPS
  • VLM: MLX-optimized models (Granite, SmolDocling)
  • Good performance for small-medium models
  • Use MLX engine when available

Intel GPU

  • Layout: All models via XPU
  • Table Structure: Supported
  • Limited VLM support
  • Check compatibility for specific models

Additional Resources

Source: ~/workspace/source/docs/usage/model_catalog.md:328

Vision Models Guide

VLM-specific documentation

GPU Acceleration

GPU acceleration setup

Pipeline Options

Advanced configuration

Supported Formats

Input format support

Notes

Source: ~/workspace/source/docs/usage/model_catalog.md:335
  • DocTags Format: Structured XML-like format optimized for document understanding
  • Markdown Format: Human-readable format for general-purpose conversion
  • Model Updates: New models are added regularly - check the codebase for latest additions
  • Engine Compatibility: Not all engines work on all platforms - AUTO_INLINE handles this automatically
  • Performance: Actual performance varies by hardware, document complexity, and model size
Use AUTO_INLINE engine for VLMs to automatically select the best available inference engine for your platform.

Build docs developers (and LLMs) love