Structure Overview
The PageIndex tree structure organizes documents into a hierarchy of nodes, where each node represents a section or subsection of the document. This mirrors how humans naturally organize and navigate complex documents.Node Properties
Each node in the tree contains the following key properties:title: The section heading or title extracted from the documentnode_id: A unique identifier for the node (e.g., “0001”, “0002”)start_index: The page number where the section beginsend_index: The page number where the section endsnodes: An array of child nodes (subsections) if the section contains nested contentsummary(optional): An AI-generated summary of the section content
The
start_index and end_index are 1-based page numbers that correspond to the physical page numbers in the PDF document.Real Example: Federal Reserve Annual Report
Here’s an actual tree structure generated from the Federal Reserve’s 2023 Annual Report:Hierarchical Organization
The tree structure supports multiple levels of nesting, allowing for complex document hierarchies:Document Description
When enabled, PageIndex can generate a high-level description of the entire document:Node Summaries
Each node can include an AI-generated summary that captures the key information in that section:Summaries are generated by LLMs analyzing the actual content of each section, providing context-rich metadata that aids in retrieval and understanding.
Tree Generation Process
PageIndex generates tree structures through multiple approaches depending on the document:- With Table of Contents: If the document has a TOC with page numbers, PageIndex extracts and validates it
- Without Page Numbers: If the TOC lacks page numbers, PageIndex matches section titles to page content
- No Table of Contents: PageIndex generates the structure by analyzing document hierarchy directly from content
Use Cases
The tree structure is ideal for:- Financial reports and regulatory filings (10-Ks, annual reports)
- Academic textbooks and research papers
- Legal documents and technical manuals
- Policy documents and government reports
- Any document that exceeds LLM context limits
Benefits Over Chunking
Natural Boundaries
Sections follow the document’s natural structure, not arbitrary token limits
Preserved Context
The hierarchical relationship between sections is maintained
Traceability
Each node maps directly to specific page ranges in the original document
Reasoning-Friendly
LLMs can reason about section relevance using titles and summaries
Next Steps
Reasoning-Based RAG
Learn how PageIndex uses tree structures for intelligent retrieval
Generate Tree Structure
Start generating tree structures from your documents