Tree Structure

PageIndex transforms lengthy PDF documents into a semantic tree structure, similar to a “table of contents” but optimized for use with Large Language Models (LLMs). This hierarchical representation enables reasoning-based retrieval by organizing documents into naturally structured sections.

Structure Overview

The PageIndex tree structure organizes documents into a hierarchy of nodes, where each node represents a section or subsection of the document. This mirrors how humans naturally organize and navigate complex documents.

Node Properties

Each node in the tree contains the following key properties:

title: The section heading or title extracted from the document
node_id: A unique identifier for the node (e.g., “0001”, “0002”)
start_index: The page number where the section begins
end_index: The page number where the section ends
nodes: An array of child nodes (subsections) if the section contains nested content
summary (optional): An AI-generated summary of the section content

The start_index and end_index are 1-based page numbers that correspond to the physical page numbers in the PDF document.

Real Example: Federal Reserve Annual Report

Here’s an actual tree structure generated from the Federal Reserve’s 2023 Annual Report:

{
  "doc_name": "2023-annual-report.pdf",
  "structure": [
    {
      "title": "Preface",
      "start_index": 1,
      "end_index": 4,
      "node_id": "0000"
    },
    {
      "title": "Financial Stability",
      "start_index": 21,
      "end_index": 21,
      "nodes": [
        {
          "title": "Monitoring Financial Vulnerabilities",
          "start_index": 22,
          "end_index": 28,
          "node_id": "0007"
        },
        {
          "title": "Domestic and International Cooperation and Coordination",
          "start_index": 28,
          "end_index": 31,
          "node_id": "0008"
        }
      ],
      "node_id": "0006"
    },
    {
      "title": "Supervision and Regulation",
      "start_index": 31,
      "end_index": 31,
      "nodes": [
        {
          "title": "Supervised and Regulated Institutions",
          "start_index": 32,
          "end_index": 35,
          "node_id": "0010"
        },
        {
          "title": "Supervisory Developments",
          "start_index": 35,
          "end_index": 54,
          "node_id": "0011"
        }
      ],
      "node_id": "0009"
    }
  ]
}

Hierarchical Organization

The tree structure supports multiple levels of nesting, allowing for complex document hierarchies:

{
  "title": "Parent Section",
  "start_index": 1,
  "end_index": 10,
  "node_id": "0001",
  "nodes": [
    {
      "title": "Child Section 1",
      "start_index": 1,
      "end_index": 5,
      "node_id": "0002",
      "nodes": [
        {
          "title": "Grandchild Section",
          "start_index": 2,
          "end_index": 3,
          "node_id": "0003"
        }
      ]
    },
    {
      "title": "Child Section 2",
      "start_index": 6,
      "end_index": 10,
      "node_id": "0004"
    }
  ]
}

The hierarchical structure enables efficient tree search algorithms to navigate documents, similar to how human experts would scan a table of contents to find relevant information.

Document Description

When enabled, PageIndex can generate a high-level description of the entire document:

{
  "doc_name": "q1-fy25-earnings.pdf",
  "doc_description": "A comprehensive financial report detailing The Walt Disney Company's first-quarter fiscal 2025 performance, including revenue growth, segment highlights, guidance for fiscal 2025, and key financial metrics such as adjusted EPS, operating income, and cash flow.",
  "structure": [...]
}

Node Summaries

Each node can include an AI-generated summary that captures the key information in that section:

{
  "title": "Financial Results for the Quarter",
  "start_index": 1,
  "end_index": 1,
  "node_id": "0001",
  "summary": "The Walt Disney Company's financial performance for Q1 fiscal 2025: Revenue increased 5% to $24.7 billion, income before taxes rose 27% to $3.7 billion, and diluted EPS grew 35% to $1.40. Total segment operating income increased 31% to $5.1 billion."
}

Summaries are generated by LLMs analyzing the actual content of each section, providing context-rich metadata that aids in retrieval and understanding.

Tree Generation Process

PageIndex generates tree structures through multiple approaches depending on the document:

With Table of Contents: If the document has a TOC with page numbers, PageIndex extracts and validates it
Without Page Numbers: If the TOC lacks page numbers, PageIndex matches section titles to page content
No Table of Contents: PageIndex generates the structure by analyzing document hierarchy directly from content

Use Cases

The tree structure is ideal for:

Financial reports and regulatory filings (10-Ks, annual reports)
Academic textbooks and research papers
Legal documents and technical manuals
Policy documents and government reports
Any document that exceeds LLM context limits

Benefits Over Chunking

Natural Boundaries

Sections follow the document’s natural structure, not arbitrary token limits

Preserved Context

The hierarchical relationship between sections is maintained

Traceability

Each node maps directly to specific page ranges in the original document

Reasoning-Friendly

LLMs can reason about section relevance using titles and summaries

Next Steps

Reasoning-Based RAG

Learn how PageIndex uses tree structures for intelligent retrieval

Generate Tree Structure

Start generating tree structures from your documents

Get Started

Core Concepts

Guides

Cookbook

Tutorials

Structure Overview

Node Properties

Real Example: Federal Reserve Annual Report

Hierarchical Organization

Document Description

Node Summaries

Tree Generation Process

Use Cases

Benefits Over Chunking

Natural Boundaries

Preserved Context

Traceability

Reasoning-Friendly

Next Steps

Reasoning-Based RAG

Generate Tree Structure

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Cookbook

Tutorials

Documentation Index

​Structure Overview

​Node Properties

​Real Example: Federal Reserve Annual Report

​Hierarchical Organization

​Document Description

​Node Summaries

​Tree Generation Process

​Use Cases

​Benefits Over Chunking

Natural Boundaries

Preserved Context

Traceability

Reasoning-Friendly

​Next Steps

Reasoning-Based RAG

Generate Tree Structure

Build docs developers (and LLMs) love

Structure Overview

Node Properties

Real Example: Federal Reserve Annual Report

Hierarchical Organization

Document Description

Node Summaries

Tree Generation Process

Use Cases

Benefits Over Chunking

Next Steps