Quick Start

Overview

This guide will walk you through generating your first PageIndex tree structure from a PDF document. You’ll transform a lengthy PDF into a hierarchical, semantic tree index optimized for LLM-powered retrieval.

PageIndex is ideal for financial reports, regulatory filings, academic textbooks, legal or technical manuals, and any document that exceeds LLM context limits.

Install Dependencies

Install the required Python packages using pip:

pip3 install --upgrade -r requirements.txt

This will install:

openai==1.101.0 - OpenAI API client
pymupdf==1.26.4 - PDF parsing
PyPDF2==3.0.1 - Additional PDF utilities
python-dotenv==1.1.0 - Environment variable management
tiktoken==0.11.0 - Token counting
pyyaml==6.0.2 - Configuration file support

Configure Your OpenAI API Key

Create a .env file in the root directory and add your OpenAI API key:

CHATGPT_API_KEY=your_openai_key_here

PageIndex uses GPT-4o by default for high-quality tree generation. Make sure your API key has access to the required models.

Run PageIndex on Your PDF

Process your PDF document to generate the tree structure:

python3 run_pageindex.py --pdf_path /path/to/your/document.pdf

PageIndex will:

Parse your PDF and extract text content
Detect the table of contents (if present)
Generate a hierarchical tree structure with summaries
Save the output as a JSON file in ./results/

You should see output like:

Parsing done, saving to file...
Tree structure saved to: ./results/document_structure.json

Explore the Generated Tree

Open the generated JSON file to explore your document’s tree structure:

{
  "title": "Financial Stability",
  "node_id": "0006",
  "start_index": 21,
  "end_index": 22,
  "summary": "The Federal Reserve ...",
  "nodes": [
    {
      "title": "Monitoring Financial Vulnerabilities",
      "node_id": "0007",
      "start_index": 22,
      "end_index": 28,
      "summary": "The Federal Reserve's monitoring ..."
    },
    {
      "title": "Domestic and International Cooperation",
      "node_id": "0008",
      "start_index": 28,
      "end_index": 31,
      "summary": "In 2023, the Federal Reserve collaborated ..."
    }
  ]
}

Each node contains:

title - Section heading
node_id - Unique identifier
start_index / end_index - Page range
summary - AI-generated section summary
nodes - Nested subsections (if any)

Customization Options

PageIndex supports various optional parameters to customize tree generation:

python3 run_pageindex.py \
  --pdf_path document.pdf \
  --model gpt-4o-2024-11-20

Available Parameters

Parameter	Default	Description
`--model`	`gpt-4o-2024-11-20`	OpenAI model to use for tree generation
`--toc-check-pages`	`20`	Number of pages to check for table of contents
`--max-pages-per-node`	`10`	Maximum pages per tree node
`--max-tokens-per-node`	`20000`	Maximum tokens per tree node
`--if-add-node-id`	`yes`	Add unique node IDs
`--if-add-node-summary`	`yes`	Generate AI summaries for each node
`--if-add-doc-description`	`no`	Add document-level description

Markdown Support

PageIndex also supports markdown files. Use the --md_path flag instead:

python3 run_pageindex.py --md_path /path/to/your/document.md

PageIndex uses # markers to determine node hierarchy. Ensure your markdown file has proper heading structure (## for level 2, ### for level 3, etc.).

If your markdown was converted from PDF or HTML, most tools don’t preserve hierarchy correctly. Consider using PageIndex OCR for better conversion quality.

Next Steps

Vectorless RAG Cookbook

Build a complete reasoning-based RAG system with PageIndex

API Reference

Explore the full Python API and configuration options

Tree Search Tutorial

Learn how to perform reasoning-based retrieval over the tree

Example Documents

See sample PDFs and their generated tree structures

Get Started

Core Concepts

Guides

Cookbook

Tutorials

Overview

Customization Options

Available Parameters

Markdown Support

Next Steps

Vectorless RAG Cookbook

API Reference

Tree Search Tutorial

Example Documents

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Cookbook

Tutorials

Documentation Index

​Overview

​Customization Options

​Available Parameters

​Markdown Support

​Next Steps

Vectorless RAG Cookbook

API Reference

Tree Search Tutorial

Example Documents

Build docs developers (and LLMs) love

Overview

Customization Options

Available Parameters

Markdown Support

Next Steps