Skip to main content

Overview

The PageIndex CLI provides a command-line interface for processing PDF and Markdown documents. The main entry point is run_pageindex.py. Source: run_pageindex.py

Basic Usage

Process PDF Document

python3 run_pageindex.py --pdf_path /path/to/document.pdf

Process Markdown Document

python3 run_pageindex.py --md_path /path/to/document.md
You must specify either --pdf_path or --md_path, but not both.

Global Options

These options apply to both PDF and Markdown processing:
--model
str
default:"gpt-4o-2024-11-20"
OpenAI model to use for processingExample:
python3 run_pageindex.py --pdf_path doc.pdf --model gpt-4o
--if-add-node-id
str
default:"yes"
Whether to add sequential node IDs. Values: yes or noExample:
python3 run_pageindex.py --pdf_path doc.pdf --if-add-node-id no
--if-add-node-summary
str
default:"yes"
Whether to generate AI summaries for each node. Values: yes or noExample:
python3 run_pageindex.py --pdf_path doc.pdf --if-add-node-summary yes
--if-add-doc-description
str
default:"no"
Whether to generate a document description. Values: yes or noOnly works if --if-add-node-summary yesExample:
python3 run_pageindex.py --pdf_path doc.pdf --if-add-doc-description yes
--if-add-node-text
str
default:"no"
Whether to include full text in nodes. Values: yes or noExample:
python3 run_pageindex.py --pdf_path doc.pdf --if-add-node-text yes

PDF-Specific Options

These options only apply when using --pdf_path:
--toc-check-pages
int
default:"20"
Number of pages to check for table of contentsExample:
python3 run_pageindex.py --pdf_path doc.pdf --toc-check-pages 30
--max-pages-per-node
int
default:"10"
Maximum number of pages per nodeExample:
python3 run_pageindex.py --pdf_path doc.pdf --max-pages-per-node 15
--max-tokens-per-node
int
default:"20000"
Maximum number of tokens per nodeExample:
python3 run_pageindex.py --pdf_path doc.pdf --max-tokens-per-node 25000

Markdown-Specific Options

These options only apply when using --md_path:
--if-thinning
str
default:"no"
Whether to apply tree thinning. Values: yes or noMerges small nodes (below threshold) with their parentsExample:
python3 run_pageindex.py --md_path doc.md --if-thinning yes
--thinning-threshold
int
default:"5000"
Minimum token threshold for nodes when thinning is enabledExample:
python3 run_pageindex.py --md_path doc.md --if-thinning yes --thinning-threshold 3000
--summary-token-threshold
int
default:"200"
Token threshold for generating summaries vs. using full textExample:
python3 run_pageindex.py --md_path doc.md --summary-token-threshold 300

Complete Examples

PDF: Fast Processing (No Summaries)

python3 run_pageindex.py \
  --pdf_path document.pdf \
  --if-add-node-summary no \
  --if-add-doc-description no
python3 run_pageindex.py \
  --pdf_path financial_report.pdf \
  --model gpt-4o-2024-11-20 \
  --toc-check-pages 30 \
  --max-pages-per-node 15 \
  --max-tokens-per-node 25000 \
  --if-add-node-id yes \
  --if-add-node-summary yes \
  --if-add-doc-description yes \
  --if-add-node-text yes

PDF: Minimal (Structure Only)

python3 run_pageindex.py \
  --pdf_path document.pdf \
  --if-add-node-id no \
  --if-add-node-summary no

Markdown: Basic Processing

python3 run_pageindex.py \
  --md_path documentation.md \
  --model gpt-4o

Markdown: With Thinning and Summaries

python3 run_pageindex.py \
  --md_path large_doc.md \
  --if-thinning yes \
  --thinning-threshold 5000 \
  --if-add-node-summary yes \
  --summary-token-threshold 200 \
  --if-add-doc-description yes
python3 run_pageindex.py \
  --md_path manual.md \
  --model gpt-4o-2024-11-20 \
  --if-thinning yes \
  --thinning-threshold 3000 \
  --if-add-node-summary yes \
  --summary-token-threshold 200 \
  --if-add-doc-description yes \
  --if-add-node-text yes \
  --if-add-node-id yes

Output

The CLI saves results to the ./results/ directory:
  • PDF: ./results/{pdf_name}_structure.json
  • Markdown: ./results/{md_name}_structure.json
Example output:
$ python3 run_pageindex.py --pdf_path report.pdf
Parsing PDF...
Processing...
Parsing done, saving to file...
Tree structure saved to: ./results/report_structure.json

File Format Validation

The CLI validates input files:

PDF Files

  • Must have .pdf extension
  • File must exist at specified path
  • Must be a valid PDF

Markdown Files

  • Must have .md or .markdown extension
  • File must exist at specified path
Error Examples:
# Invalid: wrong extension
$ python3 run_pageindex.py --pdf_path document.txt
ValueError: PDF file must have .pdf extension

# Invalid: file not found
$ python3 run_pageindex.py --pdf_path missing.pdf
ValueError: PDF file not found: missing.pdf

# Invalid: both file types specified
$ python3 run_pageindex.py --pdf_path doc.pdf --md_path doc.md
ValueError: Only one of --pdf_path or --md_path can be specified

Environment Variables

The CLI requires the OpenAI API key to be set:

Using .env File

Create a .env file in the project root:
CHATGPT_API_KEY=your_openai_key_here

Using Shell Export

export CHATGPT_API_KEY=your_openai_key_here
python3 run_pageindex.py --pdf_path document.pdf

Performance Tips

Fast Processing (Seconds)

python3 run_pageindex.py \
  --pdf_path doc.pdf \
  --if-add-node-summary no

Balanced (1-3 Minutes)

python3 run_pageindex.py \
  --pdf_path doc.pdf \
  --if-add-node-summary yes \
  --if-add-node-text no
python3 run_pageindex.py \
  --pdf_path doc.pdf \
  --if-add-node-summary yes \
  --if-add-doc-description yes \
  --if-add-node-text yes

Common Use Cases

Research Papers

python3 run_pageindex.py \
  --pdf_path paper.pdf \
  --toc-check-pages 15 \
  --max-pages-per-node 8 \
  --if-add-node-summary yes

Financial Reports

python3 run_pageindex.py \
  --pdf_path 10k_filing.pdf \
  --toc-check-pages 30 \
  --max-pages-per-node 20 \
  --max-tokens-per-node 30000 \
  --if-add-node-summary yes

Technical Manuals

python3 run_pageindex.py \
  --pdf_path manual.pdf \
  --toc-check-pages 25 \
  --if-add-node-text yes \
  --if-add-node-summary yes

Documentation Sites (Markdown)

python3 run_pageindex.py \
  --md_path README.md \
  --if-thinning yes \
  --thinning-threshold 4000 \
  --if-add-node-summary yes

Batch Processing

Process multiple files with a shell script:
#!/bin/bash
# process_all.sh

for file in pdfs/*.pdf; do
  echo "Processing $file..."
  python3 run_pageindex.py --pdf_path "$file" --if-add-node-summary yes
done

Troubleshooting

”API key not found"

# Set API key in .env file
echo "CHATGPT_API_KEY=sk-..." > .env

"Module not found"

# Install dependencies
pip3 install --upgrade -r requirements.txt

"Out of memory"

# Reduce token limits
python3 run_pageindex.py \
  --pdf_path large.pdf \
  --max-tokens-per-node 15000 \
  --if-add-node-text no

"Processing too slow”

# Disable summaries
python3 run_pageindex.py \
  --pdf_path doc.pdf \
  --if-add-node-summary no

See Also

Build docs developers (and LLMs) love