Documentation Index Fetch the complete documentation index at: https://mintlify.com/docling-project/docling/llms.txt
Use this file to discover all available pages before exploring further.
Docling is available as an official LlamaIndex extension, providing two powerful components: the Docling Reader for document loading and the Docling Node Parser for intelligent chunking.
Overview
The LlamaIndex Docling integration provides:
Docling Reader - Load documents with high-fidelity structural preservation
Docling Node Parser - Parse documents into LlamaIndex nodes with structure awareness
Lossless Serialization - Preserve complete document structure as JSON
Flexible Export - Export to simplified formats like Markdown when needed
Installation
pip install llama-index-readers-docling llama-index-node-parser-docling
Components
Docling Reader
The Docling Reader loads document files and populates LlamaIndex Document objects with Docling’s rich data model.
Basic Usage
from llama_index.readers.docling import DoclingReader
# Create reader
reader = DoclingReader()
# Load documents
documents = reader.load_data( file_path = "document.pdf" )
# Access document content
for doc in documents:
print (doc.text)
print (doc.metadata)
from llama_index.readers.docling import DoclingReader
from docling.datamodel.base_models import FormatOptions
# Export as Markdown
reader = DoclingReader(
export_type = "markdown"
)
docs = reader.load_data( file_path = "document.pdf" )
# Export as JSON (lossless)
reader = DoclingReader(
export_type = "json"
)
docs = reader.load_data( file_path = "document.pdf" )
Docling Node Parser
The Docling Node Parser uses knowledge of Docling’s format to intelligently parse documents into LlamaIndex Node objects for downstream usage.
Basic Usage
from llama_index.readers.docling import DoclingReader
from llama_index.node_parser.docling import DoclingNodeParser
# Load documents with Docling Reader
reader = DoclingReader( export_type = "json" )
documents = reader.load_data( file_path = "document.pdf" )
# Parse into nodes
parser = DoclingNodeParser()
nodes = parser.get_nodes_from_documents(documents)
# Nodes are ready for embedding and retrieval
for node in nodes:
print (node.text)
print (node.metadata)
Advanced Parsing Options
from llama_index.node_parser.docling import DoclingNodeParser
# Configure parser
parser = DoclingNodeParser(
include_metadata = True ,
chunk_size = 1024 ,
chunk_overlap = 128
)
nodes = parser.get_nodes_from_documents(documents)
Complete RAG Pipeline
Here’s a full example combining both components:
from llama_index.readers.docling import DoclingReader
from llama_index.node_parser.docling import DoclingNodeParser
from llama_index.core import VectorStoreIndex
from llama_index.embeddings.openai import OpenAIEmbedding
# 1. Load documents
reader = DoclingReader( export_type = "json" )
documents = reader.load_data( file_path = "document.pdf" )
# 2. Parse into nodes
parser = DoclingNodeParser()
nodes = parser.get_nodes_from_documents(documents)
# 3. Create embeddings and index
embed_model = OpenAIEmbedding()
index = VectorStoreIndex(nodes, embed_model = embed_model)
# 4. Query the index
query_engine = index.as_query_engine()
response = query_engine.query( "What is the main topic of this document?" )
print (response)
Features
Structure-Aware Preserves document hierarchy and relationships
Lossless Export JSON export maintains complete document structure
Smart Chunking Node parser respects document structure when chunking
Rich Metadata Includes page numbers, headings, and structural information
Use Cases
Knowledge Base RAG
# Process multiple documents
reader = DoclingReader( export_type = "json" )
documents = []
for file in [ "doc1.pdf" , "doc2.docx" , "doc3.pptx" ]:
docs = reader.load_data( file_path = file )
documents.extend(docs)
parser = DoclingNodeParser()
nodes = parser.get_nodes_from_documents(documents)
index = VectorStoreIndex(nodes)
Table-Aware Retrieval
# Docling preserves table structure
reader = DoclingReader(
export_type = "markdown" ,
pipeline_options = { "do_table_structure" : True }
)
documents = reader.load_data( file_path = "report.pdf" )
parser = DoclingNodeParser()
nodes = parser.get_nodes_from_documents(documents)
# Tables are preserved in node content
for node in nodes:
if "table" in node.metadata.get( "type" , "" ).lower():
print ( "Found table:" , node.text)
Integration Benefits
Official Components
Maintained as official LlamaIndex integrations
Two-Component System
Reader and Parser work together seamlessly
Format Flexibility
Choose between lossless JSON or simplified Markdown
Production Ready
Used in real-world RAG applications
Resources
Tutorial Step-by-step guide
Reader Docs API reference for Docling Reader
Parser Docs API reference for Node Parser
PyPI Packages
Next Steps