Overview
PageIndex can be configured through:- CLI arguments when using
run_pageindex.py - Function parameters when using the Python API
- Config file (
config.yaml) for default values
Configuration Methods
CLI Configuration
Programmatic Configuration (PDF)
Programmatic Configuration (Markdown)
Default Configuration File
PageIndex reads defaults frompageindex/config.yaml:
Universal Parameters
These parameters apply to both PDF and Markdown processing.Model Configuration
The LLM model to use for structure extraction (PDF) and summary generation.Supported Models:Python (PDF):Python (Markdown):
gpt-4o-2024-11-20(recommended)gpt-4ogpt-4-turbo- Any OpenAI-compatible model endpoint
Content Enrichment
Whether to add unique node IDs to each section in the tree.Options: CLI:Python:
yes, noExample Output with yes:Whether to generate AI-powered summaries for each node.Options: CLI:Python:
yes, noExample Output:Whether to generate an overall description of the entire document.Options: CLI:Python:
yes, noExample Output:Requires
if_add_node_summary=yes to function.Whether to include the full text content of each section in the output.Options: CLI:Python:
yes, noExample Output:PDF-Only Parameters
These parameters only apply when processing PDF documents.TOC Detection
Number of pages to scan from the beginning for table of contents detection.CLI Name: Python:
--toc-check-pagesUse Cases:- Default (20): Works for most documents
- 30-50: Documents with long, multi-page TOCs
- 5-10: Short documents or those without TOCs
Node Size Limits
Maximum number of pages allowed per node before recursive subdivision.CLI Name: Python:
--max-pages-per-nodeWhen a node exceeds this page count AND the token limit, PageIndex will:- Extract sub-structure from that node
- Recursively subdivide until all nodes are within limits
- Small (5-8 pages): Fine-grained retrieval, more nodes
- Medium (10-15 pages): Balanced approach (recommended)
- Large (20+ pages): Faster processing, coarser structure
Maximum number of tokens allowed per node before recursive subdivision.CLI Name: Python:
--max-tokens-per-nodeThis works in conjunction with max_page_num_each_node. A node is subdivided if it exceeds BOTH limits.Guidelines:- Small (10000-15000 tokens): Detailed structure
- Medium (20000-30000 tokens): Balanced (recommended)
- Large (40000+ tokens): Minimal subdivision
Tokens are counted using the specified model’s tokenizer.
Markdown-Only Parameters
These parameters only apply when processing Markdown documents.Tree Thinning
Whether to apply tree thinning to merge small sections with their parents.CLI Options:
Python Options:Python:
yes, noPython Options:
True, FalseHow It Works:- Calculate total tokens for each node (including all descendants)
- If total < threshold, merge children into parent
- Remove child nodes from tree structure
Minimum token count for keeping a section and its children separate (used with tree thinning).CLI Name: Python:
--thinning-thresholdSections with total tokens (including descendants) below this threshold will be merged into their parent.Guidelines:- 2000-3000: Aggressive merging, flatter tree
- 5000-8000: Balanced approach (recommended)
- 10000+: Minimal merging, preserve structure
Token threshold below which full text is used instead of generating a summary.CLI Name: Python:
--summary-token-thresholdFor short sections, the full text is often more useful than a summary. This parameter defines that threshold.Guidelines:- 100-150: Only very short sections use full text
- 200-300: Balanced approach (recommended)
- 500+: Most sections get summaries instead of full text
Only applies when
if_add_node_summary=yes.Configuration Precedence
When parameters are specified in multiple places, PageIndex uses this priority order:- CLI arguments / Function parameters (highest priority)
- Environment variables (if applicable)
- config.yaml file (lowest priority)
Example
Ifconfig.yaml has:
model:gpt-4o-2024-11-20(from config.yaml)if_add_node_summary:no(from CLI argument, overrides config)
Common Configuration Patterns
Fast Processing (Minimal Enrichment)
Use Case: Quick structure extraction without summaries.Speed: Fast
Balanced Approach (Recommended)
Use Case: Good balance of detail and processing time.Speed: Moderate
Maximum Detail
Use Case: Complete extraction with all metadata.Speed: Slow
Markdown with Thinning
Use Case: Simplify markdown structure by merging small sections.Speed: Fast (fewer nodes to summarize)
Environment Variables
PageIndex uses these environment variables:Your OpenAI API key for LLM calls.
Custom API endpoint (for OpenAI-compatible services).
Tips & Best Practices
Next Steps
- Try generating trees from PDFs
- Try generating trees from Markdown
- Learn about tree search strategies