Skip to main content

Overview

This guide will walk you through generating your first PageIndex tree structure from a PDF document. You’ll transform a lengthy PDF into a hierarchical, semantic tree index optimized for LLM-powered retrieval.
PageIndex is ideal for financial reports, regulatory filings, academic textbooks, legal or technical manuals, and any document that exceeds LLM context limits.
1

Install Dependencies

Install the required Python packages using pip:
pip3 install --upgrade -r requirements.txt
This will install:
  • openai==1.101.0 - OpenAI API client
  • pymupdf==1.26.4 - PDF parsing
  • PyPDF2==3.0.1 - Additional PDF utilities
  • python-dotenv==1.1.0 - Environment variable management
  • tiktoken==0.11.0 - Token counting
  • pyyaml==6.0.2 - Configuration file support
2

Configure Your OpenAI API Key

Create a .env file in the root directory and add your OpenAI API key:
CHATGPT_API_KEY=your_openai_key_here
PageIndex uses GPT-4o by default for high-quality tree generation. Make sure your API key has access to the required models.
3

Run PageIndex on Your PDF

Process your PDF document to generate the tree structure:
python3 run_pageindex.py --pdf_path /path/to/your/document.pdf
PageIndex will:
  1. Parse your PDF and extract text content
  2. Detect the table of contents (if present)
  3. Generate a hierarchical tree structure with summaries
  4. Save the output as a JSON file in ./results/
You should see output like:
Parsing done, saving to file...
Tree structure saved to: ./results/document_structure.json
4

Explore the Generated Tree

Open the generated JSON file to explore your document’s tree structure:
{
  "title": "Financial Stability",
  "node_id": "0006",
  "start_index": 21,
  "end_index": 22,
  "summary": "The Federal Reserve ...",
  "nodes": [
    {
      "title": "Monitoring Financial Vulnerabilities",
      "node_id": "0007",
      "start_index": 22,
      "end_index": 28,
      "summary": "The Federal Reserve's monitoring ..."
    },
    {
      "title": "Domestic and International Cooperation",
      "node_id": "0008",
      "start_index": 28,
      "end_index": 31,
      "summary": "In 2023, the Federal Reserve collaborated ..."
    }
  ]
}
Each node contains:
  • title - Section heading
  • node_id - Unique identifier
  • start_index / end_index - Page range
  • summary - AI-generated section summary
  • nodes - Nested subsections (if any)

Customization Options

PageIndex supports various optional parameters to customize tree generation:
python3 run_pageindex.py \
  --pdf_path document.pdf \
  --model gpt-4o-2024-11-20

Available Parameters

ParameterDefaultDescription
--modelgpt-4o-2024-11-20OpenAI model to use for tree generation
--toc-check-pages20Number of pages to check for table of contents
--max-pages-per-node10Maximum pages per tree node
--max-tokens-per-node20000Maximum tokens per tree node
--if-add-node-idyesAdd unique node IDs
--if-add-node-summaryyesGenerate AI summaries for each node
--if-add-doc-descriptionnoAdd document-level description

Markdown Support

PageIndex also supports markdown files. Use the --md_path flag instead:
python3 run_pageindex.py --md_path /path/to/your/document.md
PageIndex uses # markers to determine node hierarchy. Ensure your markdown file has proper heading structure (## for level 2, ### for level 3, etc.).
If your markdown was converted from PDF or HTML, most tools don’t preserve hierarchy correctly. Consider using PageIndex OCR for better conversion quality.

Next Steps

Vectorless RAG Cookbook

Build a complete reasoning-based RAG system with PageIndex

API Reference

Explore the full Python API and configuration options

Tree Search Tutorial

Learn how to perform reasoning-based retrieval over the tree

Example Documents

See sample PDFs and their generated tree structures

Build docs developers (and LLMs) love