Overview
PageIndex can automatically generate hierarchical tree structures from PDF documents by:- Detecting and extracting table of contents (if present)
- Identifying section boundaries and page numbers
- Recursively building a hierarchical tree structure
- Optionally generating summaries and descriptions
Quick Start
Basic Usage with CLI
Generate a tree structure from a PDF:This will create a JSON file at
./results/document_structure.json.CLI Parameters
Required Parameters
Path to the PDF file to process. Must have
.pdf extension.Model Configuration
LLM model to use for structure extraction and summary generation.
PDF-Specific Parameters
Number of pages to check for table of contents detection.
Maximum number of pages allowed per node. Larger nodes will be recursively subdivided.
Maximum number of tokens per node. Nodes exceeding this will be subdivided.
Content Enrichment Parameters
Whether to add unique node IDs to each node. Options:
yes, no.Whether to generate AI summaries for each node. Options:
yes, no.Whether to generate an overall document description. Options:
yes, no.Whether to include full text content in each node. Options:
yes, no.Programmatic API
Using page_index() Function
The page_index() function provides a programmatic interface with the same configuration options:
Function Parameters
Path to PDF file or BytesIO object containing PDF data.
LLM model identifier for structure extraction.
Number of pages to scan for table of contents.
Maximum pages per node before subdivision.
Maximum tokens per node before subdivision.
Add unique identifiers to nodes (
yes/no).Generate AI summaries for nodes (
yes/no).Generate document-level description (
yes/no).Include full text in nodes (
yes/no).Processing Pipeline
PageIndex follows this workflow when processing PDFs:TOC Detection
Check the first N pages (default: 20) for table of contents:
- If TOC with page numbers is found, use it as the structure base
- If TOC without page numbers is found, match sections to physical pages
- If no TOC is found, extract structure directly from content
Structure Extraction
Use LLM to identify hierarchical sections and their boundaries:
- Extract section titles and hierarchy levels
- Map sections to physical page indices
- Verify section boundaries are correct
Verification & Correction
Validate the extracted structure:
- Check if section titles appear on their assigned pages
- Fix any incorrect page assignments
- Retry with alternative methods if accuracy is low
Recursive Subdivision
For nodes exceeding size limits:
- Recursively extract sub-structure from large nodes
- Apply same verification process to sub-nodes
Output Format
The generated JSON structure contains:Field Descriptions
- doc_name: Filename without extension
- doc_description: Overall document summary (if
if_add_doc_description=yes) - structure: Array of top-level sections
- title: Section heading
- node_id: Unique identifier (if
if_add_node_id=yes) - start_index: Starting page number (1-indexed)
- end_index: Ending page number (inclusive)
- summary: AI-generated section summary (if
if_add_node_summary=yes) - prefix_summary: Summary of content before child sections (for parent nodes)
- text: Full text content (if
if_add_node_text=yes) - nodes: Child subsections (recursive structure)
Advanced Examples
Generate with Full Text and Summaries
Process Large Documents
For large documents, increase the node size limits:Minimal Processing (Structure Only)
Tips & Best Practices
Troubleshooting
No TOC Found
If PageIndex doesn’t detect a TOC when one exists:- Increase
--toc-check-pagesto scan more pages - The TOC might be formatted unusually; PageIndex will fall back to content-based extraction
Incorrect Page Boundaries
If section boundaries are inaccurate:- The verification system will attempt automatic correction
- Check the processing logs for accuracy metrics
- Consider adjusting node size parameters
Incomplete Structure
If some sections are missing:- Verify the PDF has readable text (not scanned images)
- Check if the document length exceeds validation thresholds
- Review processing logs for truncation warnings
Next Steps
- Learn about Markdown processing
- Explore configuration options in detail
- Implement tree search strategies for retrieval