Overview
The PageIndex CLI provides a command-line interface for processing PDF and Markdown documents. The main entry point is run_pageindex.py.
Source: run_pageindex.py
Basic Usage
Process PDF Document
python3 run_pageindex.py --pdf_path /path/to/document.pdf
Process Markdown Document
python3 run_pageindex.py --md_path /path/to/document.md
You must specify either --pdf_path or --md_path, but not both.
Global Options
These options apply to both PDF and Markdown processing:
--model
str
default:"gpt-4o-2024-11-20"
OpenAI model to use for processingExample:python3 run_pageindex.py --pdf_path doc.pdf --model gpt-4o
Whether to add sequential node IDs. Values: yes or noExample:python3 run_pageindex.py --pdf_path doc.pdf --if-add-node-id no
Whether to generate AI summaries for each node. Values: yes or noExample:python3 run_pageindex.py --pdf_path doc.pdf --if-add-node-summary yes
Whether to generate a document description. Values: yes or noOnly works if --if-add-node-summary yesExample:python3 run_pageindex.py --pdf_path doc.pdf --if-add-doc-description yes
Whether to include full text in nodes. Values: yes or noExample:python3 run_pageindex.py --pdf_path doc.pdf --if-add-node-text yes
PDF-Specific Options
These options only apply when using --pdf_path:
Number of pages to check for table of contentsExample:python3 run_pageindex.py --pdf_path doc.pdf --toc-check-pages 30
Maximum number of pages per nodeExample:python3 run_pageindex.py --pdf_path doc.pdf --max-pages-per-node 15
Maximum number of tokens per nodeExample:python3 run_pageindex.py --pdf_path doc.pdf --max-tokens-per-node 25000
Markdown-Specific Options
These options only apply when using --md_path:
Whether to apply tree thinning. Values: yes or noMerges small nodes (below threshold) with their parentsExample:python3 run_pageindex.py --md_path doc.md --if-thinning yes
Minimum token threshold for nodes when thinning is enabledExample:python3 run_pageindex.py --md_path doc.md --if-thinning yes --thinning-threshold 3000
--summary-token-threshold
Token threshold for generating summaries vs. using full textExample:python3 run_pageindex.py --md_path doc.md --summary-token-threshold 300
Complete Examples
PDF: Fast Processing (No Summaries)
python3 run_pageindex.py \
--pdf_path document.pdf \
--if-add-node-summary no \
--if-add-doc-description no
PDF: Full-Featured Processing
python3 run_pageindex.py \
--pdf_path financial_report.pdf \
--model gpt-4o-2024-11-20 \
--toc-check-pages 30 \
--max-pages-per-node 15 \
--max-tokens-per-node 25000 \
--if-add-node-id yes \
--if-add-node-summary yes \
--if-add-doc-description yes \
--if-add-node-text yes
PDF: Minimal (Structure Only)
python3 run_pageindex.py \
--pdf_path document.pdf \
--if-add-node-id no \
--if-add-node-summary no
Markdown: Basic Processing
python3 run_pageindex.py \
--md_path documentation.md \
--model gpt-4o
Markdown: With Thinning and Summaries
python3 run_pageindex.py \
--md_path large_doc.md \
--if-thinning yes \
--thinning-threshold 5000 \
--if-add-node-summary yes \
--summary-token-threshold 200 \
--if-add-doc-description yes
Markdown: Full-Featured
python3 run_pageindex.py \
--md_path manual.md \
--model gpt-4o-2024-11-20 \
--if-thinning yes \
--thinning-threshold 3000 \
--if-add-node-summary yes \
--summary-token-threshold 200 \
--if-add-doc-description yes \
--if-add-node-text yes \
--if-add-node-id yes
Output
The CLI saves results to the ./results/ directory:
- PDF:
./results/{pdf_name}_structure.json
- Markdown:
./results/{md_name}_structure.json
Example output:
$ python3 run_pageindex.py --pdf_path report.pdf
Parsing PDF...
Processing...
Parsing done, saving to file...
Tree structure saved to: ./results/report_structure.json
The CLI validates input files:
PDF Files
- Must have
.pdf extension
- File must exist at specified path
- Must be a valid PDF
Markdown Files
- Must have
.md or .markdown extension
- File must exist at specified path
Error Examples:
# Invalid: wrong extension
$ python3 run_pageindex.py --pdf_path document.txt
ValueError: PDF file must have .pdf extension
# Invalid: file not found
$ python3 run_pageindex.py --pdf_path missing.pdf
ValueError: PDF file not found: missing.pdf
# Invalid: both file types specified
$ python3 run_pageindex.py --pdf_path doc.pdf --md_path doc.md
ValueError: Only one of --pdf_path or --md_path can be specified
Environment Variables
The CLI requires the OpenAI API key to be set:
Using .env File
Create a .env file in the project root:
CHATGPT_API_KEY=your_openai_key_here
Using Shell Export
export CHATGPT_API_KEY=your_openai_key_here
python3 run_pageindex.py --pdf_path document.pdf
Fast Processing (Seconds)
python3 run_pageindex.py \
--pdf_path doc.pdf \
--if-add-node-summary no
Balanced (1-3 Minutes)
python3 run_pageindex.py \
--pdf_path doc.pdf \
--if-add-node-summary yes \
--if-add-node-text no
Full-Featured (3-10 Minutes)
python3 run_pageindex.py \
--pdf_path doc.pdf \
--if-add-node-summary yes \
--if-add-doc-description yes \
--if-add-node-text yes
Common Use Cases
Research Papers
python3 run_pageindex.py \
--pdf_path paper.pdf \
--toc-check-pages 15 \
--max-pages-per-node 8 \
--if-add-node-summary yes
Financial Reports
python3 run_pageindex.py \
--pdf_path 10k_filing.pdf \
--toc-check-pages 30 \
--max-pages-per-node 20 \
--max-tokens-per-node 30000 \
--if-add-node-summary yes
Technical Manuals
python3 run_pageindex.py \
--pdf_path manual.pdf \
--toc-check-pages 25 \
--if-add-node-text yes \
--if-add-node-summary yes
Documentation Sites (Markdown)
python3 run_pageindex.py \
--md_path README.md \
--if-thinning yes \
--thinning-threshold 4000 \
--if-add-node-summary yes
Batch Processing
Process multiple files with a shell script:
#!/bin/bash
# process_all.sh
for file in pdfs/*.pdf; do
echo "Processing $file..."
python3 run_pageindex.py --pdf_path "$file" --if-add-node-summary yes
done
Troubleshooting
”API key not found"
# Set API key in .env file
echo "CHATGPT_API_KEY=sk-..." > .env
"Module not found"
# Install dependencies
pip3 install --upgrade -r requirements.txt
"Out of memory"
# Reduce token limits
python3 run_pageindex.py \
--pdf_path large.pdf \
--max-tokens-per-node 15000 \
--if-add-node-text no
"Processing too slow”
# Disable summaries
python3 run_pageindex.py \
--pdf_path doc.pdf \
--if-add-node-summary no
See Also