Generating Tree from PDF

Overview

PageIndex can automatically generate hierarchical tree structures from PDF documents by:

Detecting and extracting table of contents (if present)
Identifying section boundaries and page numbers
Recursively building a hierarchical tree structure
Optionally generating summaries and descriptions

Quick Start

Install PageIndex

pip install pageindex

Basic Usage with CLI

Generate a tree structure from a PDF:

python run_pageindex.py --pdf_path document.pdf

This will create a JSON file at ./results/document_structure.json.

Programmatic Usage

Use the page_index function in your Python code:

from pageindex import page_index

result = page_index('document.pdf')
print(result['doc_name'])
print(result['structure'])

CLI Parameters

Required Parameters

--pdf_path

string

required

Path to the PDF file to process. Must have .pdf extension.

python run_pageindex.py --pdf_path /path/to/document.pdf

Model Configuration

--model

string

default:"gpt-4o-2024-11-20"

LLM model to use for structure extraction and summary generation.

python run_pageindex.py --pdf_path document.pdf --model gpt-4o-2024-11-20

PDF-Specific Parameters

--toc-check-pages

integer

default:"20"

Number of pages to check for table of contents detection.

python run_pageindex.py --pdf_path document.pdf --toc-check-pages 30

--max-pages-per-node

integer

default:"10"

Maximum number of pages allowed per node. Larger nodes will be recursively subdivided.

python run_pageindex.py --pdf_path document.pdf --max-pages-per-node 15

--max-tokens-per-node

integer

default:"20000"

Maximum number of tokens per node. Nodes exceeding this will be subdivided.

python run_pageindex.py --pdf_path document.pdf --max-tokens-per-node 25000

Content Enrichment Parameters

--if-add-node-id

string

default:"yes"

Whether to add unique node IDs to each node. Options: yes, no.

python run_pageindex.py --pdf_path document.pdf --if-add-node-id yes

--if-add-node-summary

string

default:"yes"

Whether to generate AI summaries for each node. Options: yes, no.

python run_pageindex.py --pdf_path document.pdf --if-add-node-summary yes

--if-add-doc-description

string

default:"no"

Whether to generate an overall document description. Options: yes, no.

python run_pageindex.py --pdf_path document.pdf --if-add-doc-description yes

--if-add-node-text

string

default:"no"

Whether to include full text content in each node. Options: yes, no.

python run_pageindex.py --pdf_path document.pdf --if-add-node-text yes

Programmatic API

Using `page_index()` Function

The page_index() function provides a programmatic interface with the same configuration options:

from pageindex import page_index

result = page_index(
    doc='document.pdf',
    model='gpt-4o-2024-11-20',
    toc_check_page_num=20,
    max_page_num_each_node=10,
    max_token_num_each_node=20000,
    if_add_node_id='yes',
    if_add_node_summary='yes',
    if_add_doc_description='no',
    if_add_node_text='no'
)

print(f"Document: {result['doc_name']}")
if 'doc_description' in result:
    print(f"Description: {result['doc_description']}")
print(f"Structure: {result['structure']}")

Function Parameters

doc

string | BytesIO

required

Path to PDF file or BytesIO object containing PDF data.

model

string

default:"gpt-4o-2024-11-20"

LLM model identifier for structure extraction.

toc_check_page_num

integer

default:"20"

Number of pages to scan for table of contents.

max_page_num_each_node

integer

default:"10"

Maximum pages per node before subdivision.

max_token_num_each_node

integer

default:"20000"

Maximum tokens per node before subdivision.

if_add_node_id

string

default:"yes"

Add unique identifiers to nodes (yes/no).

if_add_node_summary

string

default:"yes"

Generate AI summaries for nodes (yes/no).

if_add_doc_description

string

default:"no"

Generate document-level description (yes/no).

if_add_node_text

string

default:"no"

Include full text in nodes (yes/no).

Processing Pipeline

PageIndex follows this workflow when processing PDFs:

PDF Parsing

Extract text and token counts from each page of the PDF.

TOC Detection

Check the first N pages (default: 20) for table of contents:

If TOC with page numbers is found, use it as the structure base
If TOC without page numbers is found, match sections to physical pages
If no TOC is found, extract structure directly from content

Structure Extraction

Use LLM to identify hierarchical sections and their boundaries:

Extract section titles and hierarchy levels
Map sections to physical page indices
Verify section boundaries are correct

Verification & Correction

Validate the extracted structure:

Check if section titles appear on their assigned pages
Fix any incorrect page assignments
Retry with alternative methods if accuracy is low

Recursive Subdivision

For nodes exceeding size limits:

Recursively extract sub-structure from large nodes
Apply same verification process to sub-nodes

Enrichment (Optional)

Add additional metadata:

Generate node IDs (e.g., “0001”, “0002”)
Extract full text for each node
Generate AI summaries for each section
Generate overall document description

Output Format

The generated JSON structure contains:

{
  "doc_name": "document",
  "doc_description": "High-level description of the document (optional)",
  "structure": [
    {
      "title": "Section Title",
      "node_id": "0001",
      "start_index": 5,
      "end_index": 12,
      "summary": "AI-generated summary (optional)",
      "text": "Full section text (optional)",
      "nodes": [
        {
          "title": "Subsection Title",
          "node_id": "0002",
          "start_index": 5,
          "end_index": 8,
          "summary": "Subsection summary"
        }
      ]
    }
  ]
}

Field Descriptions

doc_name: Filename without extension
doc_description: Overall document summary (if if_add_doc_description=yes)
structure: Array of top-level sections
title: Section heading
node_id: Unique identifier (if if_add_node_id=yes)
start_index: Starting page number (1-indexed)
end_index: Ending page number (inclusive)
summary: AI-generated section summary (if if_add_node_summary=yes)
prefix_summary: Summary of content before child sections (for parent nodes)
text: Full text content (if if_add_node_text=yes)
nodes: Child subsections (recursive structure)

Advanced Examples

Generate with Full Text and Summaries

python run_pageindex.py \
  --pdf_path document.pdf \
  --if-add-node-text yes \
  --if-add-node-summary yes \
  --if-add-doc-description yes

Process Large Documents

For large documents, increase the node size limits:

python run_pageindex.py \
  --pdf_path large_document.pdf \
  --max-pages-per-node 20 \
  --max-tokens-per-node 30000 \
  --toc-check-pages 50

Minimal Processing (Structure Only)

python run_pageindex.py \
  --pdf_path document.pdf \
  --if-add-node-id no \
  --if-add-node-summary no

Tips & Best Practices

TOC Detection Range: If your PDF has a long table of contents spanning many pages, increase --toc-check-pages to ensure complete detection.

Node Size Tuning: Adjust --max-pages-per-node and --max-tokens-per-node based on your use case:

Smaller values: More granular structure, better for precise retrieval
Larger values: Faster processing, better for high-level navigation

Enabling --if-add-node-summary yes significantly increases processing time and API costs as it requires LLM calls for each node.

BytesIO Support: You can process PDFs from memory:

from io import BytesIO
from pageindex import page_index

with open('document.pdf', 'rb') as f:
    pdf_bytes = BytesIO(f.read())

result = page_index(pdf_bytes)

Troubleshooting

No TOC Found

If PageIndex doesn’t detect a TOC when one exists:

Increase --toc-check-pages to scan more pages
The TOC might be formatted unusually; PageIndex will fall back to content-based extraction

Incorrect Page Boundaries

If section boundaries are inaccurate:

The verification system will attempt automatic correction
Check the processing logs for accuracy metrics
Consider adjusting node size parameters

Incomplete Structure

If some sections are missing:

Verify the PDF has readable text (not scanned images)
Check if the document length exceeds validation thresholds
Review processing logs for truncation warnings

Next Steps

Learn about Markdown processing
Explore configuration options in detail
Implement tree search strategies for retrieval

Get Started

Core Concepts

Guides

Cookbook

Tutorials

Generating Tree from PDF

Overview

Quick Start

CLI Parameters

Required Parameters

Model Configuration

PDF-Specific Parameters

Content Enrichment Parameters

Programmatic API

Using `page_index()` Function

Function Parameters

Processing Pipeline

Output Format

Field Descriptions

Advanced Examples

Generate with Full Text and Summaries

Process Large Documents

Minimal Processing (Structure Only)

Tips & Best Practices

Troubleshooting

No TOC Found

Incorrect Page Boundaries

Incomplete Structure

Next Steps

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Cookbook

Tutorials

Documentation Index

​Overview

​Quick Start

​CLI Parameters

​Required Parameters

​Model Configuration

​PDF-Specific Parameters

​Content Enrichment Parameters

​Programmatic API

​Using page_index() Function

​Function Parameters

​Processing Pipeline

​Output Format

​Field Descriptions

​Advanced Examples

​Generate with Full Text and Summaries

​Process Large Documents

​Minimal Processing (Structure Only)

​Tips & Best Practices

​Troubleshooting

​No TOC Found

​Incorrect Page Boundaries

​Incomplete Structure

​Next Steps

Build docs developers (and LLMs) love

Overview

Quick Start

CLI Parameters

Required Parameters

Model Configuration

PDF-Specific Parameters

Content Enrichment Parameters

Programmatic API

Using `page_index()` Function

Function Parameters

Processing Pipeline

Output Format

Field Descriptions

Advanced Examples

Generate with Full Text and Summaries

Process Large Documents

Minimal Processing (Structure Only)

Tips & Best Practices

Troubleshooting

No TOC Found

Incorrect Page Boundaries

Incomplete Structure

Next Steps