page_index()

Function Signature

def page_index(
    doc,
    model=None,
    toc_check_page_num=None,
    max_page_num_each_node=None,
    max_token_num_each_node=None,
    if_add_node_id=None,
    if_add_node_summary=None,
    if_add_doc_description=None,
    if_add_node_text=None
)

Source: pageindex/page_index.py:1103

Description

The page_index() function is the main entry point for generating a PageIndex tree structure from PDF documents. It automatically loads default configuration from config.yaml and merges user-provided parameters, then delegates to page_index_main() for processing.

Parameters

doc

str or BytesIO

required

Path to PDF file or BytesIO object containing the PDF data. Must be a valid PDF file path (ending in .pdf) or a BytesIO object.

model

str

default:"gpt-4o-2024-11-20"

OpenAI model to use for processing. Examples: "gpt-4o-2024-11-20", "gpt-4o", "gpt-4.1"

toc_check_page_num

int

default:"20"

Number of pages to check for table of contents detection. The function will scan this many pages from the beginning to find TOC pages.

max_page_num_each_node

int

default:"10"

Maximum number of pages allowed in each node. Nodes exceeding this limit will be recursively subdivided.

max_token_num_each_node

int

default:"20000"

Maximum token count per node. Used in conjunction with max_page_num_each_node to determine when to subdivide large nodes.

if_add_node_id

str

default:"yes"

Whether to add sequential node IDs to the tree structure. Valid values: "yes" or "no"

if_add_node_summary

str

default:"yes"

Whether to generate AI summaries for each node. Valid values: "yes" or "no"

if_add_doc_description

str

default:"no"

Whether to generate a one-sentence description for the entire document. Valid values: "yes" or "no"

if_add_node_text

str

default:"no"

Whether to include full text content in each node. Valid values: "yes" or "no"

Return Value

result

dict

A dictionary containing the document structure:

Show Structure fields

doc_name

str

The name of the PDF document (filename without extension)

doc_description

str

One-sentence description of the document (only if if_add_doc_description="yes")

structure

list[dict]

Hierarchical tree structure with nodes containing:

title: Section title
node_id: Sequential identifier (if enabled)
start_index: Starting page number
end_index: Ending page number
summary: AI-generated summary (if enabled)
text: Full text content (if enabled)
nodes: Child nodes (if any)

Example Usage

Basic Usage with Defaults

from pageindex import page_index

# Process PDF with default settings
result = page_index("document.pdf")

print(result['doc_name'])
print(f"Found {len(result['structure'])} top-level sections")

Custom Configuration

from pageindex import page_index

# Process with custom parameters
result = page_index(
    doc="financial_report.pdf",
    model="gpt-4o-2024-11-20",
    toc_check_page_num=30,
    max_page_num_each_node=15,
    max_token_num_each_node=25000,
    if_add_node_id="yes",
    if_add_node_summary="yes",
    if_add_doc_description="yes",
    if_add_node_text="no"
)

print(f"Document: {result['doc_name']}")
print(f"Description: {result['doc_description']}")

Processing BytesIO Objects

from io import BytesIO
from pageindex import page_index

# Read PDF from bytes
with open("document.pdf", "rb") as f:
    pdf_bytes = BytesIO(f.read())

result = page_index(pdf_bytes)

With Full Text Content

from pageindex import page_index

# Include full text in each node
result = page_index(
    doc="manual.pdf",
    if_add_node_text="yes"
)

# Access text from nodes
for node in result['structure']:
    print(f"Section: {node['title']}")
    print(f"Pages: {node['start_index']}-{node['end_index']}")
    print(f"Text preview: {node['text'][:200]}...")

Notes

All parameters are optional; defaults are loaded from config.yaml
Only parameters that differ from defaults need to be specified
The function uses ConfigLoader internally to merge user options with defaults
Processing time varies based on document length and enabled features
API costs depend on document size and whether summaries are generated

Python API

CLI

Cloud API

Function Signature

Description

Parameters

Return Value

Example Usage

Basic Usage with Defaults

Custom Configuration

Processing BytesIO Objects

With Full Text Content

Notes

See Also

Build docs developers (and LLMs) love

Python API

CLI

Cloud API

Documentation Index

​Function Signature

​Description

​Parameters

​Return Value

​Example Usage

​Basic Usage with Defaults

​Custom Configuration

​Processing BytesIO Objects

​With Full Text Content

​Notes

​See Also

Build docs developers (and LLMs) love

Function Signature

Description

Parameters

Return Value

Example Usage

Basic Usage with Defaults

Custom Configuration

Processing BytesIO Objects

With Full Text Content

Notes

See Also