Skip to main content
Open in Colab

Introduction

In modern document question answering (QA) systems, Optical Character Recognition (OCR) serves an important role by converting PDF pages into text that can be processed by Large Language Models (LLMs). Traditional OCR systems typically use a two-stage process that first detects the layout of a PDF — dividing it into text, tables, and images — and then recognizes and converts these elements into plain text. With the rise of vision-language models (VLMs) such as Qwen-VL and GPT-4.1, new end-to-end OCR models like DeepSeek-OCR have emerged. This raises an important question:
If a VLM can already process both the document images and the query to produce an answer directly, do we still need the intermediate OCR step?
In this cookbook, we give a practical implementation of a vision-based question-answering system for long documents, without relying on OCR. Specifically, we use PageIndex as a reasoning-based retrieval layer and OpenAI’s multimodal GPT-4.1 as the VLM for visual reasoning and answer generation. See the original blog post for a more detailed discussion.

What You’ll Learn

This cookbook demonstrates a minimal, vision-based vectorless RAG pipeline for long documents with PageIndex, using only visual context from PDF pages. You will learn how to:
  • Build a PageIndex tree structure of a document
  • Perform reasoning-based retrieval with tree search
  • Extract PDF page images of retrieved tree nodes for visual context
  • Generate answers using VLM with PDF image inputs only (no OCR required)
This example uses PageIndex’s reasoning-based retrieval with OpenAI’s multimodal GPT-4.1 model for both tree search and visual context reasoning.

Setup

1

Install Dependencies

Install the required packages:
pip install --upgrade pageindex requests openai PyMuPDF
2

Setup PageIndex Client

Initialize the PageIndex client:
from pageindex import PageIndexClient
import pageindex.utils as utils

# Get your PageIndex API key from https://dash.pageindex.ai/api-keys
PAGEINDEX_API_KEY = "YOUR_PAGEINDEX_API_KEY"
pi_client = PageIndexClient(api_key=PAGEINDEX_API_KEY)
3

Setup VLM

Configure the OpenAI multimodal GPT-4.1 as the VLM:
import openai, fitz, base64, os

OPENAI_API_KEY = "YOUR_OPENAI_API_KEY"

async def call_vlm(prompt, image_paths=None, model="gpt-4.1"):
    client = openai.AsyncOpenAI(api_key=OPENAI_API_KEY)
    messages = [{"role": "user", "content": prompt}]
    if image_paths:
        content = [{"type": "text", "text": prompt}]
        for image in image_paths:
            if os.path.exists(image):
                with open(image, "rb") as image_file:
                    image_data = base64.b64encode(image_file.read()).decode('utf-8')
                    content.append({
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/jpeg;base64,{image_data}"
                        }
                    })
        messages[0]["content"] = content
    response = await client.chat.completions.create(model=model, messages=messages, temperature=0)
    return response.choices[0].message.content.strip()
4

PDF Image Extraction Helper Functions

Create helper functions to extract PDF page images:
def extract_pdf_page_images(pdf_path, output_dir="pdf_images"):
    os.makedirs(output_dir, exist_ok=True)
    pdf_document = fitz.open(pdf_path)
    page_images = {}
    total_pages = len(pdf_document)
    for page_number in range(len(pdf_document)):
        page = pdf_document.load_page(page_number)
        # Convert page to image
        mat = fitz.Matrix(2.0, 2.0)  # 2x zoom for better quality
        pix = page.get_pixmap(matrix=mat)
        img_data = pix.tobytes("jpeg")
        image_path = os.path.join(output_dir, f"page_{page_number + 1}.jpg")
        with open(image_path, "wb") as image_file:
            image_file.write(img_data)
        page_images[page_number + 1] = image_path
        print(f"Saved page {page_number + 1} image: {image_path}")
    pdf_document.close()
    return page_images, total_pages

def get_page_images_for_nodes(node_list, node_map, page_images):
    # Get PDF page images for retrieved nodes
    image_paths = []
    seen_pages = set()
    for node_id in node_list:
        node_info = node_map[node_id]
        for page_num in range(node_info['start_index'], node_info['end_index'] + 1):
            if page_num not in seen_pages:
                image_paths.append(page_images[page_num])
                seen_pages.add(page_num)
    return image_paths

Step 1: PageIndex Tree Generation

Submit Document and Extract Images

Download the document, extract page images, and submit for PageIndex tree generation:
import os, requests

pdf_url = "https://arxiv.org/pdf/1706.03762.pdf"  # "Attention Is All You Need" paper
pdf_path = os.path.join("../data", pdf_url.split('/')[-1])
os.makedirs(os.path.dirname(pdf_path), exist_ok=True)

response = requests.get(pdf_url)
with open(pdf_path, "wb") as f:
    f.write(response.content)
print(f"Downloaded {pdf_url}\n")

# Extract page images from PDF
print("Extracting page images...")
page_images, total_pages = extract_pdf_page_images(pdf_path)
print(f"Extracted {len(page_images)} page images from {total_pages} total pages.\n")

doc_id = pi_client.submit_document(pdf_path)["doc_id"]
print('Document Submitted:', doc_id)

Get the Tree Structure

Retrieve the generated tree:
if pi_client.is_retrieval_ready(doc_id):
    tree = pi_client.get_tree(doc_id, node_summary=True)['result']
    print('Simplified Tree Structure of the Document:')
    utils.print_tree(tree, exclude_fields=['text'])
else:
    print("Processing document, please try again later...")
Example Output:
[{'title': 'Attention Is All You Need',
  'node_id': '0000',
  'page_index': 1,
  'nodes': [{'title': 'Abstract', 'node_id': '0001', 'page_index': 1, ...},
            {'title': '3 Model Architecture',
             'node_id': '0004',
             'page_index': 2,
             'nodes': [{'title': '3.2 Attention',
                        'node_id': '0006',
                        'nodes': [{'title': '3.2.1 Scaled Dot-Product Attention',
                                   'node_id': '0007',
                                   'page_index': 4, ...}]}]}]}
]

Step 2: Reasoning-Based Retrieval

Identify Relevant Nodes

Use the VLM for tree search to identify nodes that might contain relevant context:
import json

query = "What is the last operation in the Scaled Dot-Product Attention figure?"

tree_without_text = utils.remove_fields(tree.copy(), fields=['text'])

search_prompt = f"""
You are given a question and a tree structure of a document.
Each node contains a node id, node title, and a corresponding summary.
Your task is to find all tree nodes that are likely to contain the answer to the question.

Question: {query}

Document tree structure:
{json.dumps(tree_without_text, indent=2)}

Please reply in the following JSON format:
{{
    "thinking": "<Your thinking process on which nodes are relevant to the question>",
    "node_list": ["node_id_1", "node_id_2", ..., "node_id_n"]
}}
Directly return the final JSON structure. Do not output anything else.
"""

tree_search_result = await call_vlm(search_prompt)

View Retrieved Nodes

Print the retrieved nodes and reasoning process:
node_map = utils.create_node_mapping(tree, include_page_ranges=True, max_page=total_pages)
tree_search_result_json = json.loads(tree_search_result)

print('Reasoning Process:\n')
utils.print_wrapped(tree_search_result_json['thinking'])

print('\nRetrieved Nodes:\n')
for node_id in tree_search_result_json["node_list"]:
    node_info = node_map[node_id]
    node = node_info['node']
    start_page = node_info['start_index']
    end_page = node_info['end_index']
    page_range = start_page if start_page == end_page else f"{start_page}-{end_page}"
    print(f"Node ID: {node['node_id']}\t Pages: {page_range}\t Title: {node['title']}")
Output:
Reasoning Process:

The question asks about the last operation in the Scaled Dot-Product Attention figure. The most
relevant section is the one that describes Scaled Dot-Product Attention in detail, including its
computation and the figure itself...

Retrieved Nodes:

Node ID: 0006	 Pages: 3-4	 Title: 3.2 Attention
Node ID: 0007	 Pages: 4	 Title: 3.2.1 Scaled Dot-Product Attention

Get PDF Page Images

Retrieve the corresponding PDF page images:
retrieved_nodes = tree_search_result_json["node_list"]
retrieved_page_images = get_page_images_for_nodes(retrieved_nodes, node_map, page_images)
print(f'\nRetrieved {len(retrieved_page_images)} PDF page image(s) for visual context.')

Step 3: Answer Generation with Vision

Generate Answer Using VLM

Generate an answer using the VLM with only PDF page images as visual context:
# Generate answer using VLM with only PDF page images as visual context
answer_prompt = f"""
Answer the question based on the images of the document pages as context.

Question: {query}

Provide a clear, concise answer based only on the context provided.
"""

print('Generated answer using VLM with retrieved PDF page images as visual context:\n')
answer = await call_vlm(answer_prompt, retrieved_page_images)
utils.print_wrapped(answer)
Output:
Generated answer using VLM with retrieved PDF page images as visual context:

The last operation in the **Scaled Dot-Product Attention** figure is a **MatMul** (matrix
multiplication). This operation multiplies the attention weights (after softmax) by the value matrix
\( V \).

Conclusion

In this cookbook, we demonstrated a minimal vision-based, vectorless RAG pipeline using PageIndex and a VLM. The system retrieves relevant pages by reasoning over the document’s hierarchical tree index and answers questions directly from PDF images — no OCR required.
If you’re interested in building your own reasoning-based document QA system, try PageIndex Chat, or integrate via PageIndex MCP and the Cloud API.

Learn More

Build docs developers (and LLMs) love