Skip to main content
PageIndex transforms lengthy PDF documents into a semantic tree structure, similar to a “table of contents” but optimized for use with Large Language Models (LLMs). This hierarchical representation enables reasoning-based retrieval by organizing documents into naturally structured sections.

Structure Overview

The PageIndex tree structure organizes documents into a hierarchy of nodes, where each node represents a section or subsection of the document. This mirrors how humans naturally organize and navigate complex documents.

Node Properties

Each node in the tree contains the following key properties:
  • title: The section heading or title extracted from the document
  • node_id: A unique identifier for the node (e.g., “0001”, “0002”)
  • start_index: The page number where the section begins
  • end_index: The page number where the section ends
  • nodes: An array of child nodes (subsections) if the section contains nested content
  • summary (optional): An AI-generated summary of the section content
The start_index and end_index are 1-based page numbers that correspond to the physical page numbers in the PDF document.

Real Example: Federal Reserve Annual Report

Here’s an actual tree structure generated from the Federal Reserve’s 2023 Annual Report:
{
  "doc_name": "2023-annual-report.pdf",
  "structure": [
    {
      "title": "Preface",
      "start_index": 1,
      "end_index": 4,
      "node_id": "0000"
    },
    {
      "title": "Financial Stability",
      "start_index": 21,
      "end_index": 21,
      "nodes": [
        {
          "title": "Monitoring Financial Vulnerabilities",
          "start_index": 22,
          "end_index": 28,
          "node_id": "0007"
        },
        {
          "title": "Domestic and International Cooperation and Coordination",
          "start_index": 28,
          "end_index": 31,
          "node_id": "0008"
        }
      ],
      "node_id": "0006"
    },
    {
      "title": "Supervision and Regulation",
      "start_index": 31,
      "end_index": 31,
      "nodes": [
        {
          "title": "Supervised and Regulated Institutions",
          "start_index": 32,
          "end_index": 35,
          "node_id": "0010"
        },
        {
          "title": "Supervisory Developments",
          "start_index": 35,
          "end_index": 54,
          "node_id": "0011"
        }
      ],
      "node_id": "0009"
    }
  ]
}

Hierarchical Organization

The tree structure supports multiple levels of nesting, allowing for complex document hierarchies:
{
  "title": "Parent Section",
  "start_index": 1,
  "end_index": 10,
  "node_id": "0001",
  "nodes": [
    {
      "title": "Child Section 1",
      "start_index": 1,
      "end_index": 5,
      "node_id": "0002",
      "nodes": [
        {
          "title": "Grandchild Section",
          "start_index": 2,
          "end_index": 3,
          "node_id": "0003"
        }
      ]
    },
    {
      "title": "Child Section 2",
      "start_index": 6,
      "end_index": 10,
      "node_id": "0004"
    }
  ]
}
The hierarchical structure enables efficient tree search algorithms to navigate documents, similar to how human experts would scan a table of contents to find relevant information.

Document Description

When enabled, PageIndex can generate a high-level description of the entire document:
{
  "doc_name": "q1-fy25-earnings.pdf",
  "doc_description": "A comprehensive financial report detailing The Walt Disney Company's first-quarter fiscal 2025 performance, including revenue growth, segment highlights, guidance for fiscal 2025, and key financial metrics such as adjusted EPS, operating income, and cash flow.",
  "structure": [...]
}

Node Summaries

Each node can include an AI-generated summary that captures the key information in that section:
{
  "title": "Financial Results for the Quarter",
  "start_index": 1,
  "end_index": 1,
  "node_id": "0001",
  "summary": "The Walt Disney Company's financial performance for Q1 fiscal 2025: Revenue increased 5% to $24.7 billion, income before taxes rose 27% to $3.7 billion, and diluted EPS grew 35% to $1.40. Total segment operating income increased 31% to $5.1 billion."
}
Summaries are generated by LLMs analyzing the actual content of each section, providing context-rich metadata that aids in retrieval and understanding.

Tree Generation Process

PageIndex generates tree structures through multiple approaches depending on the document:
  1. With Table of Contents: If the document has a TOC with page numbers, PageIndex extracts and validates it
  2. Without Page Numbers: If the TOC lacks page numbers, PageIndex matches section titles to page content
  3. No Table of Contents: PageIndex generates the structure by analyzing document hierarchy directly from content

Use Cases

The tree structure is ideal for:
  • Financial reports and regulatory filings (10-Ks, annual reports)
  • Academic textbooks and research papers
  • Legal documents and technical manuals
  • Policy documents and government reports
  • Any document that exceeds LLM context limits

Benefits Over Chunking

Natural Boundaries

Sections follow the document’s natural structure, not arbitrary token limits

Preserved Context

The hierarchical relationship between sections is maintained

Traceability

Each node maps directly to specific page ranges in the original document

Reasoning-Friendly

LLMs can reason about section relevance using titles and summaries

Next Steps

Reasoning-Based RAG

Learn how PageIndex uses tree structures for intelligent retrieval

Generate Tree Structure

Start generating tree structures from your documents

Build docs developers (and LLMs) love