Overview
This guide will walk you through generating your first PageIndex tree structure from a PDF document. You’ll transform a lengthy PDF into a hierarchical, semantic tree index optimized for LLM-powered retrieval.PageIndex is ideal for financial reports, regulatory filings, academic textbooks, legal or technical manuals, and any document that exceeds LLM context limits.
Install Dependencies
Install the required Python packages using pip:This will install:
openai==1.101.0- OpenAI API clientpymupdf==1.26.4- PDF parsingPyPDF2==3.0.1- Additional PDF utilitiespython-dotenv==1.1.0- Environment variable managementtiktoken==0.11.0- Token countingpyyaml==6.0.2- Configuration file support
Run PageIndex on Your PDF
Process your PDF document to generate the tree structure:PageIndex will:
- Parse your PDF and extract text content
- Detect the table of contents (if present)
- Generate a hierarchical tree structure with summaries
- Save the output as a JSON file in
./results/
Customization Options
PageIndex supports various optional parameters to customize tree generation:Available Parameters
| Parameter | Default | Description |
|---|---|---|
--model | gpt-4o-2024-11-20 | OpenAI model to use for tree generation |
--toc-check-pages | 20 | Number of pages to check for table of contents |
--max-pages-per-node | 10 | Maximum pages per tree node |
--max-tokens-per-node | 20000 | Maximum tokens per tree node |
--if-add-node-id | yes | Add unique node IDs |
--if-add-node-summary | yes | Generate AI summaries for each node |
--if-add-doc-description | no | Add document-level description |
Markdown Support
PageIndex also supports markdown files. Use the--md_path flag instead:
PageIndex uses
# markers to determine node hierarchy. Ensure your markdown file has proper heading structure (## for level 2, ### for level 3, etc.).Next Steps
Vectorless RAG Cookbook
Build a complete reasoning-based RAG system with PageIndex
API Reference
Explore the full Python API and configuration options
Tree Search Tutorial
Learn how to perform reasoning-based retrieval over the tree
Example Documents
See sample PDFs and their generated tree structures