Installation

Prerequisites

Before installing PageIndex, ensure you have:

Python 3.8+ installed on your system
pip or pip3 package manager
An OpenAI API key with access to GPT-4o models

PageIndex is designed for self-hosting and local deployment. For cloud-based solutions, see the Chat Platform or API.

Install from Source

Clone the Repository

Clone the PageIndex repository from GitHub:

git clone https://github.com/VectifyAI/PageIndex.git
cd PageIndex

Install Dependencies

Install the required Python packages:

pip3 install --upgrade -r requirements.txt

This will install the following dependencies:

requirements.txt

openai==1.101.0
pymupdf==1.26.4
PyPDF2==3.0.1
python-dotenv==1.1.0
tiktoken==0.11.0
pyyaml==6.0.2

Configure Environment Variables

Create a .env file in the root directory:

touch .env

Add your OpenAI API key:

.env

CHATGPT_API_KEY=your_openai_key_here

Never commit your .env file to version control. Add it to .gitignore to keep your API key secure.

Verify Installation

Test your installation by running PageIndex on a sample PDF:

python3 run_pageindex.py --pdf_path tests/pdfs/sample.pdf

If successful, you should see:

Parsing done, saving to file...
Tree structure saved to: ./results/sample_structure.json

Package Dependencies

PageIndex relies on several key Python packages:

OpenAI

Version: 1.101.0Official OpenAI Python client for GPT-4o API access. Used for LLM-powered reasoning and tree generation.

PyMuPDF

Version: 1.26.4High-performance PDF parsing library. Extracts text content and page structure from PDF documents.

PyPDF2

Version: 3.0.1Additional PDF utilities for metadata extraction and document manipulation.

python-dotenv

Version: 1.1.0Loads environment variables from .env files for secure API key management.

tiktoken

Version: 0.11.0OpenAI’s token counting library. Ensures nodes stay within token limits for optimal LLM processing.

PyYAML

Version: 6.0.2Configuration file parser for loading user settings and default parameters.

Python Module Structure

After installation, you can import PageIndex in your Python code:

from pageindex import *
from pageindex.page_index_md import md_to_tree

The main entry point is run_pageindex.py, which provides:

PDF Processing: page_index_main() - Generate tree from PDF
Markdown Processing: md_to_tree() - Generate tree from markdown
Configuration: config() - Customize tree generation parameters

Alternative Installation Methods

Virtual Environment (Recommended)

For isolated package management, use a virtual environment:

# Create virtual environment
python3 -m venv venv

# Activate virtual environment
source venv/bin/activate  # On Linux/Mac
venv\Scripts\activate     # On Windows

# Install dependencies
pip install --upgrade -r requirements.txt

Docker (Coming Soon)

Dockerized deployment will be available in a future release. For now, use the source installation method.

Troubleshooting

ImportError: No module named 'pageindex'

Make sure you’re running commands from the PageIndex root directory and that all dependencies are installed:

pip3 install --upgrade -r requirements.txt

OpenAI API Error: Authentication Failed

Verify your API key is correctly set in the .env file:

cat .env
# Should show: CHATGPT_API_KEY=sk-...

Ensure there are no extra spaces or quotes around the key.

PDF Parsing Errors

Some complex PDFs may have parsing issues. Try:

Ensure the PDF is not password-protected
Check that the PDF contains extractable text (not just images)
For scanned documents, consider using PageIndex OCR

Token Limit Exceeded

If nodes exceed token limits, adjust the parameters:

python3 run_pageindex.py \
  --pdf_path document.pdf \
  --max-pages-per-node 5 \
  --max-tokens-per-node 10000

Get Started

Core Concepts

Guides

Cookbook

Tutorials

Prerequisites

Install from Source

Package Dependencies

OpenAI

PyMuPDF

PyPDF2

python-dotenv

tiktoken

PyYAML

Python Module Structure

Alternative Installation Methods

Virtual Environment (Recommended)

Docker (Coming Soon)

Troubleshooting

Next Steps

Quick Start Guide

API Reference

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Cookbook

Tutorials

Documentation Index

​Prerequisites

​Install from Source

​Package Dependencies

OpenAI

PyMuPDF

PyPDF2

python-dotenv

tiktoken

PyYAML

​Python Module Structure

​Alternative Installation Methods

​Virtual Environment (Recommended)

​Docker (Coming Soon)

​Troubleshooting

​Next Steps

Quick Start Guide

API Reference

Build docs developers (and LLMs) love

Prerequisites

Install from Source

Package Dependencies

Python Module Structure

Alternative Installation Methods

Virtual Environment (Recommended)

Docker (Coming Soon)

Troubleshooting

Next Steps