Function Signature
pageindex/page_index.py:1103
Description
Thepage_index() function is the main entry point for generating a PageIndex tree structure from PDF documents. It automatically loads default configuration from config.yaml and merges user-provided parameters, then delegates to page_index_main() for processing.
Parameters
Path to PDF file or BytesIO object containing the PDF data. Must be a valid PDF file path (ending in
.pdf) or a BytesIO object.OpenAI model to use for processing. Examples:
"gpt-4o-2024-11-20", "gpt-4o", "gpt-4.1"Number of pages to check for table of contents detection. The function will scan this many pages from the beginning to find TOC pages.
Maximum number of pages allowed in each node. Nodes exceeding this limit will be recursively subdivided.
Maximum token count per node. Used in conjunction with
max_page_num_each_node to determine when to subdivide large nodes.Whether to add sequential node IDs to the tree structure. Valid values:
"yes" or "no"Whether to generate AI summaries for each node. Valid values:
"yes" or "no"Whether to generate a one-sentence description for the entire document. Valid values:
"yes" or "no"Whether to include full text content in each node. Valid values:
"yes" or "no"Return Value
A dictionary containing the document structure:
Example Usage
Basic Usage with Defaults
Custom Configuration
Processing BytesIO Objects
With Full Text Content
Notes
- All parameters are optional; defaults are loaded from
config.yaml - Only parameters that differ from defaults need to be specified
- The function uses
ConfigLoaderinternally to merge user options with defaults - Processing time varies based on document length and enabled features
- API costs depend on document size and whether summaries are generated
See Also
- page_index_main() - Lower-level function with explicit options object
- ConfigLoader - Configuration management system
- CLI Reference - Command-line interface options