Documentation Index
Fetch the complete documentation index at: https://mintlify.com/docling-project/docling/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Docling’s document processing pipeline consists of multiple stages, each using specialized models and inference engines. This catalog provides:- Processing stages and their purposes
- Model families and specific models
- Inference engine compatibility
- Usage examples and configuration
Processing Stages
Docling pipelines are composed of these processing stages:Layout
Document structure detection
OCR
Optical character recognition
Table Structure
Table cell recognition
Picture Classifier
Image type classification
VLM Convert
Full page conversion with VLMs
Picture Description
Image captioning
Code & Formula
Code/math extraction
Layout Detection
Overview
Source: ~/workspace/source/docs/usage/model_catalog.md:26 Detects document elements (paragraphs, tables, figures, headers, etc.) using RT-DETR-based object detection. Model Family: Object Detection (RT-DETR based)Inference Engine: docling-ibm-models
Supported Devices: CPU, CUDA, MPS, XPU
Available Models
Source: ~/workspace/source/docs/usage/model_catalog.md:30| Model | Status | Description |
|---|---|---|
docling-layout-heron | ⭐ Default | Recommended for most use cases |
docling-layout-heron-101 | - | Enhanced variant of Heron |
docling-layout-egret-medium | - | Medium-sized Egret model |
docling-layout-egret-large | - | Larger Egret model |
docling-layout-egret-xlarge | - | Extra-large Egret model |
docling-layout-v2 | Legacy | Previous generation model |
Usage
Source: ~/workspace/source/docs/usage/model_catalog.md:252Output
Bounding boxes with element labels:TEXT- Body text paragraphsSECTION_HEADER- Section headingsTABLE- TablesPICTURE- Images and figuresLIST_ITEM- List itemsFORMULA- Mathematical formulasPAGE_HEADER/PAGE_FOOTER- Headers/footers
OCR (Optical Character Recognition)
Overview
Source: ~/workspace/source/docs/usage/model_catalog.md:51 Extracts text from images and scanned documents using various OCR engines. Model Family: Multiple OCR EnginesInference Engines: Engine-specific
Supported Devices: Varies by engine
Available Engines
Source: ~/workspace/source/docs/usage/model_catalog.md:206| OCR Engine | Backend | Languages | GPU Support | Notes |
|---|---|---|---|---|
| Auto ⭐ | Automatic | Varies | Varies | Automatically selects best available |
| Tesseract | CLI or Python | 100+ | No | Most widely used, good accuracy |
| EasyOCR | PyTorch | 80+ | Yes | GPU-accelerated, good for Asian languages |
| RapidOCR | ONNX/OpenVINO/Paddle | Multiple | Yes (torch) | Fast, multiple backend options |
| macOS Vision | Native macOS | 20+ | Yes | macOS only, excellent quality |
| SuryaOCR | PyTorch | 90+ | Yes | Modern, good for complex layouts |
Usage
Source: ~/workspace/source/docs/usage/model_catalog.md:286Table Structure Recognition
TableFormer Models
Source: ~/workspace/source/docs/usage/model_catalog.md:70 Recognizes table structure (rows, columns, cells) and relationships. Model Family: TableFormerInference Engine: docling-ibm-models
Supported Devices: CPU, CUDA, XPU (MPS currently disabled)
Available Modes
Source: ~/workspace/source/docs/usage/model_catalog.md:74| Mode | Status | Speed | Accuracy |
|---|---|---|---|
| Accurate | ⭐ Default | Slower | Higher quality |
| Fast | - | Faster | Good quality |
Usage
Source: ~/workspace/source/docs/usage/model_catalog.md:263Object Detection (WIP)
Source: ~/workspace/source/docs/usage/model_catalog.md:86 Alternative approach for table structure recognition using object detection.Object detection-based table structure is work in progress.
Picture Classification
Overview
Source: ~/workspace/source/docs/usage/model_catalog.md:101 Classifies pictures into semantic categories (charts, diagrams, logos, etc.). Model Family: Image Classifier (Vision Transformer)Inference Engine: Transformers (ViT)
Supported Devices: CPU, CUDA, MPS, XPU
Available Models
Source: ~/workspace/source/docs/usage/model_catalog.md:104| Model | Status | Description |
|---|---|---|
DocumentFigureClassifier-v2.0 | ⭐ Default | Specialized for document imagery |
Supported Classes
- Chart types (bar, line, pie, scatter)
- Diagrams and flowcharts
- Natural images
- Logos and branding
- Signatures
- Technical illustrations
Usage
Source: ~/workspace/source/docs/usage/model_catalog.md:275VLM Convert (Full Page)
Overview
Source: ~/workspace/source/docs/usage/model_catalog.md:116 Converts entire document pages to structured formats using vision-language models. Model Family: Vision-Language ModelsOutput Formats: DocTags (structured), Markdown (human-readable)
Inference Engines: Transformers, MLX, API (Ollama, LM Studio, OpenAI), vLLM, AUTO_INLINE
Available Models
Source: ~/workspace/source/docs/usage/model_catalog.md:220| Preset ID | Model | Size | Transformers | MLX | API | vLLM | Output |
|---|---|---|---|---|---|---|---|
granite_docling ⭐ | Granite-Docling-258M | 258M | ✅ | ✅ | Ollama | ❌ | DocTags |
smoldocling | SmolDocling-256M | 256M | ✅ | ✅ | ❌ | ❌ | DocTags |
deepseek_ocr | DeepSeek-OCR-3B | 3B | ❌ | ❌ | Ollama, LM Studio | ❌ | Markdown |
granite_vision | Granite-Vision-3.3-2B | 2B | ✅ | ❌ | Ollama, LM Studio | ✅ | Markdown |
pixtral | Pixtral-12B | 12B | ✅ | ✅ | ❌ | ❌ | Markdown |
got_ocr | GOT-OCR-2.0 | - | ✅ | ❌ | ❌ | ❌ | Markdown |
phi4 | Phi-4-Multimodal | - | ✅ | ❌ | ❌ | ✅ | Markdown |
qwen | Qwen2.5-VL-3B | 3B | ✅ | ✅ | ❌ | ❌ | Markdown |
gemma_12b | Gemma-3-12B | 12B | ❌ | ✅ | ❌ | ❌ | Markdown |
gemma_27b | Gemma-3-27B | 27B | ❌ | ✅ | ❌ | ❌ | Markdown |
dolphin | Dolphin | - | ✅ | ❌ | ❌ | ❌ | Markdown |
Usage
Source: ~/workspace/source/docs/usage/model_catalog.md:294Output Formats
DocTags: Structured XML-like format optimized for document understandingPicture Description
Overview
Source: ~/workspace/source/docs/usage/model_catalog.md:143 Generates natural language descriptions (captions) of images and figures. Model Family: Vision-Language ModelsInference Engines: Transformers, MLX, API (Ollama, LM Studio), vLLM, AUTO_INLINE
Available Models
Source: ~/workspace/source/docs/usage/model_catalog.md:236| Preset ID | Model | Size | Transformers | MLX | API | vLLM |
|---|---|---|---|---|---|---|
smolvlm ⭐ | SmolVLM-256M | 256M | ✅ | ✅ | LM Studio | ❌ |
granite_vision | Granite-Vision-3.3-2B | 2B | ✅ | ❌ | Ollama, LM Studio | ✅ |
pixtral | Pixtral-12B | 12B | ✅ | ✅ | ❌ | ❌ |
qwen | Qwen2.5-VL-3B | 3B | ✅ | ✅ | ❌ | ❌ |
Usage
Source: ~/workspace/source/docs/usage/model_catalog.md:310Code & Formula Extraction
Overview
Source: ~/workspace/source/docs/usage/model_catalog.md:161 Extracts and recognizes code blocks and mathematical formulas. Model Family: Vision-Language ModelsInference Engines: Transformers, MLX, AUTO_INLINE
Available Models
Source: ~/workspace/source/docs/usage/model_catalog.md:244| Preset ID | Model | Transformers | MLX |
|---|---|---|---|
codeformulav2 ⭐ | CodeFormulaV2 | ✅ | ❌ |
granite_docling | Granite-Docling-258M | ✅ | ✅ |
Usage
Source: ~/workspace/source/docs/usage/model_catalog.md:318Inference Engine Compatibility
Object Detection Models
Source: ~/workspace/source/docs/usage/model_catalog.md:182| Stage | Engine | Devices |
|---|---|---|
| Layout | docling-ibm-models | CPU, CUDA, MPS, XPU |
| Table Structure | docling-ibm-models | CPU, CUDA, XPU |
MPS is currently disabled for TableFormer due to performance issues.
Vision-Language Models
Source: ~/workspace/source/docs/usage/model_catalog.md:220 VLM inference engines support varies by model:- Transformers: Direct HuggingFace transformers integration
- MLX: Apple Silicon optimized (macOS only)
- API: OpenAI-compatible endpoints (Ollama, LM Studio, vLLM)
- vLLM: Linux-only high-performance server
- AUTO_INLINE: Automatic engine selection
Model Selection Guide
Layout Detection
Layout Detection
Recommended:
docling-layout-heron- Good balance of speed and accuracy
- Suitable for most document types
- Use Egret models for specialized needs
OCR Engine
OCR Engine
Recommended:
Auto or Tesseract- Auto: Automatic engine selection
- Tesseract: Reliable, widely supported
- RapidOCR (torch): GPU acceleration needed
- macOS Vision: Best quality on macOS
Table Structure
Table Structure
Recommended:
Accurate mode- Use Accurate for production (better quality)
- Use Fast for quick prototyping
- Enable
do_cell_matchingfor best results
VLM Convert
VLM Convert
Recommended:
granite_docling or smoldocling- Granite Docling: Best for structured output (DocTags)
- SmolDocling: Lightweight alternative
- DeepSeek OCR: High-quality Markdown (API-only)
- Larger models (Pixtral, Qwen) for complex documents
Picture Description
Picture Description
Recommended:
smolvlm- SmolVLM: Fast, good quality, small size
- Granite Vision: More detailed descriptions
- Larger models for specialized captioning
Performance Characteristics
Model Sizes and Speed
| Model Type | Size Range | Typical Speed | GPU Benefit |
|---|---|---|---|
| Layout Detection | ~100-500MB | Fast | High |
| OCR Engines | Varies | Fast-Medium | Varies |
| Table Structure | ~100MB | Medium | High |
| Picture Classifier | ~100MB | Fast | Medium |
| Small VLMs (256M) | ~500MB-1GB | Fast | High |
| Medium VLMs (2-3B) | 2-6GB | Medium | Very High |
| Large VLMs (12B+) | 12GB+ | Slow | Critical |
Device Recommendations
CPU Only
- Layout: Heron
- OCR: Tesseract/Auto
- VLM: SmolVLM/SmolDocling (small models only)
- Expect slower processing
NVIDIA GPU
- All models supported
- Use batch processing
- Consider Flash Attention 2
- Ideal for VLM pipelines with inference servers
Apple Silicon
- Layout: All models via MPS
- VLM: MLX-optimized models (Granite, SmolDocling)
- Good performance for small-medium models
- Use MLX engine when available
Intel GPU
- Layout: All models via XPU
- Table Structure: Supported
- Limited VLM support
- Check compatibility for specific models
Additional Resources
Source: ~/workspace/source/docs/usage/model_catalog.md:328Vision Models Guide
VLM-specific documentation
GPU Acceleration
GPU acceleration setup
Pipeline Options
Advanced configuration
Supported Formats
Input format support
Notes
Source: ~/workspace/source/docs/usage/model_catalog.md:335- DocTags Format: Structured XML-like format optimized for document understanding
- Markdown Format: Human-readable format for general-purpose conversion
- Model Updates: New models are added regularly - check the codebase for latest additions
- Engine Compatibility: Not all engines work on all platforms - AUTO_INLINE handles this automatically
- Performance: Actual performance varies by hardware, document complexity, and model size