Skip to main content

Prerequisites

Before installing PageIndex, ensure you have:
  • Python 3.8+ installed on your system
  • pip or pip3 package manager
  • An OpenAI API key with access to GPT-4o models
PageIndex is designed for self-hosting and local deployment. For cloud-based solutions, see the Chat Platform or API.

Install from Source

1

Clone the Repository

Clone the PageIndex repository from GitHub:
git clone https://github.com/VectifyAI/PageIndex.git
cd PageIndex
2

Install Dependencies

Install the required Python packages:
pip3 install --upgrade -r requirements.txt
This will install the following dependencies:
requirements.txt
openai==1.101.0
pymupdf==1.26.4
PyPDF2==3.0.1
python-dotenv==1.1.0
tiktoken==0.11.0
pyyaml==6.0.2
3

Configure Environment Variables

Create a .env file in the root directory:
touch .env
Add your OpenAI API key:
.env
CHATGPT_API_KEY=your_openai_key_here
Never commit your .env file to version control. Add it to .gitignore to keep your API key secure.
4

Verify Installation

Test your installation by running PageIndex on a sample PDF:
python3 run_pageindex.py --pdf_path tests/pdfs/sample.pdf
If successful, you should see:
Parsing done, saving to file...
Tree structure saved to: ./results/sample_structure.json

Package Dependencies

PageIndex relies on several key Python packages:

OpenAI

Version: 1.101.0Official OpenAI Python client for GPT-4o API access. Used for LLM-powered reasoning and tree generation.

PyMuPDF

Version: 1.26.4High-performance PDF parsing library. Extracts text content and page structure from PDF documents.

PyPDF2

Version: 3.0.1Additional PDF utilities for metadata extraction and document manipulation.

python-dotenv

Version: 1.1.0Loads environment variables from .env files for secure API key management.

tiktoken

Version: 0.11.0OpenAI’s token counting library. Ensures nodes stay within token limits for optimal LLM processing.

PyYAML

Version: 6.0.2Configuration file parser for loading user settings and default parameters.

Python Module Structure

After installation, you can import PageIndex in your Python code:
from pageindex import *
from pageindex.page_index_md import md_to_tree
The main entry point is run_pageindex.py, which provides:
  • PDF Processing: page_index_main() - Generate tree from PDF
  • Markdown Processing: md_to_tree() - Generate tree from markdown
  • Configuration: config() - Customize tree generation parameters

Alternative Installation Methods

For isolated package management, use a virtual environment:
# Create virtual environment
python3 -m venv venv

# Activate virtual environment
source venv/bin/activate  # On Linux/Mac
venv\Scripts\activate     # On Windows

# Install dependencies
pip install --upgrade -r requirements.txt

Docker (Coming Soon)

Dockerized deployment will be available in a future release. For now, use the source installation method.

Troubleshooting

Make sure you’re running commands from the PageIndex root directory and that all dependencies are installed:
pip3 install --upgrade -r requirements.txt
Verify your API key is correctly set in the .env file:
cat .env
# Should show: CHATGPT_API_KEY=sk-...
Ensure there are no extra spaces or quotes around the key.
Some complex PDFs may have parsing issues. Try:
  1. Ensure the PDF is not password-protected
  2. Check that the PDF contains extractable text (not just images)
  3. For scanned documents, consider using PageIndex OCR
If nodes exceed token limits, adjust the parameters:
python3 run_pageindex.py \
  --pdf_path document.pdf \
  --max-pages-per-node 5 \
  --max-tokens-per-node 10000

Next Steps

Quick Start Guide

Generate your first PageIndex tree structure

API Reference

Explore configuration options and Python API

Build docs developers (and LLMs) love