Function Signature
pageindex/page_index.py:1058
Description
Thepage_index_main() function is the core processing function that generates a hierarchical tree structure from PDF documents. Unlike page_index(), it requires an explicit configuration object (opt) rather than individual parameters.
This function orchestrates the entire PageIndex generation pipeline:
- Validates input (PDF file path or BytesIO object)
- Extracts text and tokens from PDF pages
- Detects and processes table of contents (if present)
- Generates hierarchical structure
- Optionally adds node IDs, summaries, and descriptions
Parameters
Path to PDF file or BytesIO object containing PDF data. Must be:
- A valid file path ending in
.pdf, OR - A
BytesIOobject containing PDF data
ValueError if input type is unsupported.Configuration object containing processing options. If
None, default configuration is loaded from config.yaml. The config object should contain:model: OpenAI model nametoc_check_page_num: Pages to check for TOCmax_page_num_each_node: Max pages per nodemax_token_num_each_node: Max tokens per nodeif_add_node_id: Whether to add node IDs (“yes”/“no”)if_add_node_summary: Whether to generate summaries (“yes”/“no”)if_add_doc_description: Whether to generate doc description (“yes”/“no”)if_add_node_text: Whether to include text (“yes”/“no”)
Return Value
Dictionary containing the PageIndex structure:
Example Usage
Using ConfigLoader
Custom Configuration
Processing BytesIO
Minimal Configuration
With Full Features
Processing Pipeline
The function executes these steps internally:- Validation: Checks if input is a valid PDF file or BytesIO object
- Text Extraction: Parses PDF and extracts text with token counts per page
- TOC Detection: Searches for table of contents in first N pages
- Structure Generation: Creates hierarchical tree using:
- Detected TOC with page numbers, OR
- Detected TOC without page numbers, OR
- AI-generated structure (no TOC found)
- Verification: Validates generated structure accuracy
- Recursive Subdivision: Splits large nodes exceeding thresholds
- Enrichment: Adds node IDs, summaries, and descriptions if requested
Error Handling
Performance Considerations
- Processing time: ~1-5 minutes for typical documents (50-200 pages)
- Memory usage: Proportional to document size and enabled features
- API costs: Higher with summaries enabled; scales with document length
- Large nodes (>10 pages, >20k tokens) are recursively subdivided
See Also
- page_index() - Simplified function with individual parameters
- ConfigLoader - Configuration management
- CLI Reference - Command-line interface