CLI Reference - PageIndex

Overview

The PageIndex CLI provides a command-line interface for processing PDF and Markdown documents. The main entry point is run_pageindex.py. Source: run_pageindex.py

Basic Usage

Process PDF Document

python3 run_pageindex.py --pdf_path /path/to/document.pdf

Process Markdown Document

python3 run_pageindex.py --md_path /path/to/document.md

You must specify either --pdf_path or --md_path, but not both.

Global Options

These options apply to both PDF and Markdown processing:

--model

str

default:"gpt-4o-2024-11-20"

OpenAI model to use for processingExample:

python3 run_pageindex.py --pdf_path doc.pdf --model gpt-4o

--if-add-node-id

str

default:"yes"

Whether to add sequential node IDs. Values: yes or noExample:

python3 run_pageindex.py --pdf_path doc.pdf --if-add-node-id no

--if-add-node-summary

str

default:"yes"

Whether to generate AI summaries for each node. Values: yes or noExample:

python3 run_pageindex.py --pdf_path doc.pdf --if-add-node-summary yes

--if-add-doc-description

str

default:"no"

Whether to generate a document description. Values: yes or noOnly works if --if-add-node-summary yesExample:

python3 run_pageindex.py --pdf_path doc.pdf --if-add-doc-description yes

--if-add-node-text

str

default:"no"

Whether to include full text in nodes. Values: yes or noExample:

python3 run_pageindex.py --pdf_path doc.pdf --if-add-node-text yes

PDF-Specific Options

These options only apply when using --pdf_path:

--toc-check-pages

int

default:"20"

Number of pages to check for table of contentsExample:

python3 run_pageindex.py --pdf_path doc.pdf --toc-check-pages 30

--max-pages-per-node

int

default:"10"

Maximum number of pages per nodeExample:

python3 run_pageindex.py --pdf_path doc.pdf --max-pages-per-node 15

--max-tokens-per-node

int

default:"20000"

Maximum number of tokens per nodeExample:

python3 run_pageindex.py --pdf_path doc.pdf --max-tokens-per-node 25000

Markdown-Specific Options

These options only apply when using --md_path:

--if-thinning

str

default:"no"

Whether to apply tree thinning. Values: yes or noMerges small nodes (below threshold) with their parentsExample:

python3 run_pageindex.py --md_path doc.md --if-thinning yes

--thinning-threshold

int

default:"5000"

Minimum token threshold for nodes when thinning is enabledExample:

python3 run_pageindex.py --md_path doc.md --if-thinning yes --thinning-threshold 3000

--summary-token-threshold

int

default:"200"

Token threshold for generating summaries vs. using full textExample:

python3 run_pageindex.py --md_path doc.md --summary-token-threshold 300

Complete Examples

PDF: Fast Processing (No Summaries)

python3 run_pageindex.py \
  --pdf_path document.pdf \
  --if-add-node-summary no \
  --if-add-doc-description no

PDF: Full-Featured Processing

python3 run_pageindex.py \
  --pdf_path financial_report.pdf \
  --model gpt-4o-2024-11-20 \
  --toc-check-pages 30 \
  --max-pages-per-node 15 \
  --max-tokens-per-node 25000 \
  --if-add-node-id yes \
  --if-add-node-summary yes \
  --if-add-doc-description yes \
  --if-add-node-text yes

PDF: Minimal (Structure Only)

python3 run_pageindex.py \
  --pdf_path document.pdf \
  --if-add-node-id no \
  --if-add-node-summary no

Markdown: Basic Processing

python3 run_pageindex.py \
  --md_path documentation.md \
  --model gpt-4o

Markdown: With Thinning and Summaries

python3 run_pageindex.py \
  --md_path large_doc.md \
  --if-thinning yes \
  --thinning-threshold 5000 \
  --if-add-node-summary yes \
  --summary-token-threshold 200 \
  --if-add-doc-description yes

Markdown: Full-Featured

python3 run_pageindex.py \
  --md_path manual.md \
  --model gpt-4o-2024-11-20 \
  --if-thinning yes \
  --thinning-threshold 3000 \
  --if-add-node-summary yes \
  --summary-token-threshold 200 \
  --if-add-doc-description yes \
  --if-add-node-text yes \
  --if-add-node-id yes

Output

The CLI saves results to the ./results/ directory:

PDF: ./results/{pdf_name}_structure.json
Markdown: ./results/{md_name}_structure.json

Example output:

$ python3 run_pageindex.py --pdf_path report.pdf
Parsing PDF...
Processing...
Parsing done, saving to file...
Tree structure saved to: ./results/report_structure.json

File Format Validation

The CLI validates input files:

PDF Files

Must have .pdf extension
File must exist at specified path
Must be a valid PDF

Markdown Files

Must have .md or .markdown extension
File must exist at specified path

Error Examples:

# Invalid: wrong extension
$ python3 run_pageindex.py --pdf_path document.txt
ValueError: PDF file must have .pdf extension

# Invalid: file not found
$ python3 run_pageindex.py --pdf_path missing.pdf
ValueError: PDF file not found: missing.pdf

# Invalid: both file types specified
$ python3 run_pageindex.py --pdf_path doc.pdf --md_path doc.md
ValueError: Only one of --pdf_path or --md_path can be specified

Environment Variables

The CLI requires the OpenAI API key to be set:

Using .env File

Create a .env file in the project root:

CHATGPT_API_KEY=your_openai_key_here

Using Shell Export

export CHATGPT_API_KEY=your_openai_key_here
python3 run_pageindex.py --pdf_path document.pdf

Performance Tips

Fast Processing (Seconds)

python3 run_pageindex.py \
  --pdf_path doc.pdf \
  --if-add-node-summary no

Balanced (1-3 Minutes)

python3 run_pageindex.py \
  --pdf_path doc.pdf \
  --if-add-node-summary yes \
  --if-add-node-text no

Full-Featured (3-10 Minutes)

python3 run_pageindex.py \
  --pdf_path doc.pdf \
  --if-add-node-summary yes \
  --if-add-doc-description yes \
  --if-add-node-text yes

Common Use Cases

Research Papers

python3 run_pageindex.py \
  --pdf_path paper.pdf \
  --toc-check-pages 15 \
  --max-pages-per-node 8 \
  --if-add-node-summary yes

Financial Reports

python3 run_pageindex.py \
  --pdf_path 10k_filing.pdf \
  --toc-check-pages 30 \
  --max-pages-per-node 20 \
  --max-tokens-per-node 30000 \
  --if-add-node-summary yes

Technical Manuals

python3 run_pageindex.py \
  --pdf_path manual.pdf \
  --toc-check-pages 25 \
  --if-add-node-text yes \
  --if-add-node-summary yes

Documentation Sites (Markdown)

python3 run_pageindex.py \
  --md_path README.md \
  --if-thinning yes \
  --thinning-threshold 4000 \
  --if-add-node-summary yes

Batch Processing

Process multiple files with a shell script:

#!/bin/bash
# process_all.sh

for file in pdfs/*.pdf; do
  echo "Processing $file..."
  python3 run_pageindex.py --pdf_path "$file" --if-add-node-summary yes
done

Troubleshooting

”API key not found"

# Set API key in .env file
echo "CHATGPT_API_KEY=sk-..." > .env

"Module not found"

# Install dependencies
pip3 install --upgrade -r requirements.txt

"Out of memory"

# Reduce token limits
python3 run_pageindex.py \
  --pdf_path large.pdf \
  --max-tokens-per-node 15000 \
  --if-add-node-text no

"Processing too slow”

# Disable summaries
python3 run_pageindex.py \
  --pdf_path doc.pdf \
  --if-add-node-summary no

Python API

CLI

Cloud API

Documentation Index

​Overview

​Basic Usage

​Process PDF Document

​Process Markdown Document

​Global Options

​PDF-Specific Options

​Markdown-Specific Options

​Complete Examples

​PDF: Fast Processing (No Summaries)

​PDF: Full-Featured Processing

​PDF: Minimal (Structure Only)

​Markdown: Basic Processing

​Markdown: With Thinning and Summaries

​Markdown: Full-Featured

​Output

​File Format Validation

​PDF Files

​Markdown Files

​Environment Variables

​Using .env File

​Using Shell Export

​Performance Tips

​Fast Processing (Seconds)

​Balanced (1-3 Minutes)

​Full-Featured (3-10 Minutes)

​Common Use Cases

​Research Papers

​Financial Reports

​Technical Manuals

​Documentation Sites (Markdown)

​Batch Processing

​Troubleshooting

​”API key not found"

​"Module not found"

​"Out of memory"

​"Processing too slow”

​See Also

Build docs developers (and LLMs) love

Overview

Basic Usage

Process PDF Document

Process Markdown Document

Global Options

PDF-Specific Options

Markdown-Specific Options

Complete Examples

PDF: Fast Processing (No Summaries)

PDF: Full-Featured Processing

PDF: Minimal (Structure Only)

Markdown: Basic Processing

Markdown: With Thinning and Summaries

Markdown: Full-Featured

Output

File Format Validation

PDF Files

Markdown Files

Environment Variables

Using .env File

Using Shell Export

Performance Tips

Fast Processing (Seconds)

Balanced (1-3 Minutes)

Full-Featured (3-10 Minutes)

Common Use Cases

Research Papers

Financial Reports

Technical Manuals

Documentation Sites (Markdown)

Batch Processing

Troubleshooting

”API key not found"

"Module not found"

"Out of memory"

"Processing too slow”

See Also