Configuration Options

Overview

PageIndex can be configured through:

CLI arguments when using run_pageindex.py
Function parameters when using the Python API
Config file (config.yaml) for default values

Configuration Methods

CLI Configuration

python run_pageindex.py --pdf_path document.pdf --model gpt-4o-2024-11-20 --if-add-node-summary yes

Programmatic Configuration (PDF)

from pageindex import page_index

result = page_index(
    doc='document.pdf',
    model='gpt-4o-2024-11-20',
    if_add_node_summary='yes'
)

Programmatic Configuration (Markdown)

import asyncio
from pageindex.page_index_md import md_to_tree

result = asyncio.run(md_to_tree(
    md_path='document.md',
    model='gpt-4o-2024-11-20',
    if_add_node_summary='yes'
))

Default Configuration File

PageIndex reads defaults from pageindex/config.yaml:

model: "gpt-4o-2024-11-20"
toc_check_page_num: 20
max_page_num_each_node: 10
max_token_num_each_node: 20000
if_add_node_id: "yes"
if_add_node_summary: "yes"
if_add_doc_description: "no"
if_add_node_text: "no"

Universal Parameters

These parameters apply to both PDF and Markdown processing.

Model Configuration

model

string

default:"gpt-4o-2024-11-20"

The LLM model to use for structure extraction (PDF) and summary generation.Supported Models:

gpt-4o-2024-11-20 (recommended)
gpt-4o
gpt-4-turbo
Any OpenAI-compatible model endpoint

CLI:

python run_pageindex.py --pdf_path doc.pdf --model gpt-4o-2024-11-20

Python (PDF):

page_index('doc.pdf', model='gpt-4o-2024-11-20')

Python (Markdown):

md_to_tree('doc.md', model='gpt-4o-2024-11-20')

Content Enrichment

if_add_node_id

string

default:"yes"

Whether to add unique node IDs to each section in the tree.Options: yes, noExample Output with yes:

{
  "title": "Introduction",
  "node_id": "0001"
}

CLI:

python run_pageindex.py --pdf_path doc.pdf --if-add-node-id yes

Python:

page_index('doc.pdf', if_add_node_id='yes')

Node IDs are essential for retrieval systems to reference specific sections.

if_add_node_summary

string

default:"yes"

Whether to generate AI-powered summaries for each node.Options: yes, noExample Output:

{
  "title": "Methodology",
  "summary": "This section describes the research methodology...",
  "prefix_summary": "Overview content before subsections..." // for parent nodes
}

CLI:

python run_pageindex.py --pdf_path doc.pdf --if-add-node-summary yes

Python:

page_index('doc.pdf', if_add_node_summary='yes')

Enabling summaries increases processing time and API costs significantly (1 LLM call per node).

if_add_doc_description

string

default:"no"

Whether to generate an overall description of the entire document.Options: yes, noExample Output:

{
  "doc_name": "research_paper",
  "doc_description": "This document presents a comprehensive study on...",
  "structure": [...]
}

CLI:

python run_pageindex.py --pdf_path doc.pdf --if-add-doc-description yes

Python:

page_index('doc.pdf', if_add_doc_description='yes')

Requires if_add_node_summary=yes to function.

if_add_node_text

string

default:"no"

Whether to include the full text content of each section in the output.Options: yes, noExample Output:

{
  "title": "Introduction",
  "text": "This is the full text content of the introduction section..."
}

CLI:

python run_pageindex.py --pdf_path doc.pdf --if-add-node-text yes

Python:

page_index('doc.pdf', if_add_node_text='yes')

Including full text significantly increases output file size.

PDF-Only Parameters

These parameters only apply when processing PDF documents.

TOC Detection

toc_check_page_num

integer

default:"20"

Number of pages to scan from the beginning for table of contents detection.CLI Name: --toc-check-pagesUse Cases:

Default (20): Works for most documents
30-50: Documents with long, multi-page TOCs
5-10: Short documents or those without TOCs

CLI:

python run_pageindex.py --pdf_path doc.pdf --toc-check-pages 30

Python:

page_index('doc.pdf', toc_check_page_num=30)

Node Size Limits

max_page_num_each_node

integer

default:"10"

Maximum number of pages allowed per node before recursive subdivision.CLI Name: --max-pages-per-nodeWhen a node exceeds this page count AND the token limit, PageIndex will:

Extract sub-structure from that node
Recursively subdivide until all nodes are within limits

Guidelines:

Small (5-8 pages): Fine-grained retrieval, more nodes
Medium (10-15 pages): Balanced approach (recommended)
Large (20+ pages): Faster processing, coarser structure

CLI:

python run_pageindex.py --pdf_path doc.pdf --max-pages-per-node 15

Python:

page_index('doc.pdf', max_page_num_each_node=15)

max_token_num_each_node

integer

default:"20000"

Maximum number of tokens allowed per node before recursive subdivision.CLI Name: --max-tokens-per-nodeThis works in conjunction with max_page_num_each_node. A node is subdivided if it exceeds BOTH limits.Guidelines:

Small (10000-15000 tokens): Detailed structure
Medium (20000-30000 tokens): Balanced (recommended)
Large (40000+ tokens): Minimal subdivision

CLI:

python run_pageindex.py --pdf_path doc.pdf --max-tokens-per-node 25000

Python:

page_index('doc.pdf', max_token_num_each_node=25000)

Tokens are counted using the specified model’s tokenizer.

Markdown-Only Parameters

These parameters only apply when processing Markdown documents.

Tree Thinning

if_thinning

string | boolean

default:"no"

Whether to apply tree thinning to merge small sections with their parents.CLI Options: yes, no
Python Options: True, FalseHow It Works:

Calculate total tokens for each node (including all descendants)
If total < threshold, merge children into parent
Remove child nodes from tree structure

CLI:

python run_pageindex.py --md_path doc.md --if-thinning yes

Python:

md_to_tree('doc.md', if_thinning=True)

Use thinning for documents with many small subsections that should be treated as cohesive units.

min_token_threshold

integer

default:"5000"

Minimum token count for keeping a section and its children separate (used with tree thinning).CLI Name: --thinning-thresholdSections with total tokens (including descendants) below this threshold will be merged into their parent.Guidelines:

2000-3000: Aggressive merging, flatter tree
5000-8000: Balanced approach (recommended)
10000+: Minimal merging, preserve structure

CLI:

python run_pageindex.py --md_path doc.md --if-thinning yes --thinning-threshold 3000

Python:

md_to_tree('doc.md', if_thinning=True, min_token_threshold=3000)

summary_token_threshold

integer

default:"200"

Token threshold below which full text is used instead of generating a summary.CLI Name: --summary-token-thresholdFor short sections, the full text is often more useful than a summary. This parameter defines that threshold.Guidelines:

100-150: Only very short sections use full text
200-300: Balanced approach (recommended)
500+: Most sections get summaries instead of full text

CLI:

python run_pageindex.py --md_path doc.md --summary-token-threshold 250

Python:

md_to_tree('doc.md', summary_token_threshold=250)

Only applies when if_add_node_summary=yes.

Configuration Precedence

When parameters are specified in multiple places, PageIndex uses this priority order:

CLI arguments / Function parameters (highest priority)
Environment variables (if applicable)
config.yaml file (lowest priority)

Example

If config.yaml has:

model: "gpt-4o-2024-11-20"
if_add_node_summary: "yes"

And you run:

python run_pageindex.py --pdf_path doc.pdf --if-add-node-summary no

Result:

model: gpt-4o-2024-11-20 (from config.yaml)
if_add_node_summary: no (from CLI argument, overrides config)

Common Configuration Patterns

Fast Processing (Minimal Enrichment)

Use Case: Quick structure extraction without summaries.

python run_pageindex.py \
  --pdf_path doc.pdf \
  --if-add-node-summary no \
  --if-add-node-id yes

Cost: Minimal (structure extraction only)
Speed: Fast

Balanced Approach (Recommended)

Use Case: Good balance of detail and processing time.

python run_pageindex.py \
  --pdf_path doc.pdf \
  --if-add-node-summary yes \
  --if-add-node-id yes \
  --max-pages-per-node 10 \
  --max-tokens-per-node 20000

Cost: Moderate (1 LLM call per node)
Speed: Moderate

Maximum Detail

Use Case: Complete extraction with all metadata.

python run_pageindex.py \
  --pdf_path doc.pdf \
  --if-add-node-summary yes \
  --if-add-doc-description yes \
  --if-add-node-text yes \
  --if-add-node-id yes \
  --max-pages-per-node 5 \
  --max-tokens-per-node 15000

Cost: High (many LLM calls)
Speed: Slow

Markdown with Thinning

Use Case: Simplify markdown structure by merging small sections.

python run_pageindex.py \
  --md_path doc.md \
  --if-thinning yes \
  --thinning-threshold 3000 \
  --if-add-node-summary yes \
  --summary-token-threshold 200

Cost: Moderate
Speed: Fast (fewer nodes to summarize)

Environment Variables

PageIndex uses these environment variables:

OPENAI_API_KEY

string

required

Your OpenAI API key for LLM calls.

export OPENAI_API_KEY="sk-..."

OPENAI_API_BASE

string

Custom API endpoint (for OpenAI-compatible services).

export OPENAI_API_BASE="https://api.custom-llm.com/v1"

Tips & Best Practices

Start Simple: Begin with default settings, then adjust based on results.

Token Limits: Keep max_token_num_each_node under your model’s context window (typically 128K for GPT-4).

Cost Control: Enabling if_add_node_summary=yes can generate hundreds of LLM calls for large documents. Monitor your API usage.

Output Size: If your JSON output is too large, consider:

Setting if_add_node_text=no
Increasing node size limits to create fewer nodes
Using summaries instead of full text

Markdown Thinning: Experiment with thinning-threshold values. Check the output structure after thinning to ensure it meets your needs.

Get Started

Core Concepts

Guides

Cookbook

Tutorials

Configuration Options

Overview

Configuration Methods

CLI Configuration

Programmatic Configuration (PDF)

Programmatic Configuration (Markdown)

Default Configuration File

Universal Parameters

Model Configuration

Content Enrichment

PDF-Only Parameters

TOC Detection

Node Size Limits

Markdown-Only Parameters

Tree Thinning

Configuration Precedence

Example

Common Configuration Patterns

Fast Processing (Minimal Enrichment)

Balanced Approach (Recommended)

Maximum Detail

Markdown with Thinning

Environment Variables

Tips & Best Practices

Next Steps

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Cookbook

Tutorials

Documentation Index

​Overview

​Configuration Methods

​CLI Configuration

​Programmatic Configuration (PDF)

​Programmatic Configuration (Markdown)

​Default Configuration File

​Universal Parameters

​Model Configuration

​Content Enrichment

​PDF-Only Parameters

​TOC Detection

​Node Size Limits

​Markdown-Only Parameters

​Tree Thinning

​Configuration Precedence

​Example

​Common Configuration Patterns

​Fast Processing (Minimal Enrichment)

​Balanced Approach (Recommended)

​Maximum Detail

​Markdown with Thinning

​Environment Variables

​Tips & Best Practices

​Next Steps

Build docs developers (and LLMs) love

Overview

Configuration Methods

CLI Configuration

Programmatic Configuration (PDF)

Programmatic Configuration (Markdown)

Default Configuration File

Universal Parameters

Model Configuration

Content Enrichment

PDF-Only Parameters

TOC Detection

Node Size Limits

Markdown-Only Parameters

Tree Thinning

Configuration Precedence

Example

Common Configuration Patterns

Fast Processing (Minimal Enrichment)

Balanced Approach (Recommended)

Maximum Detail

Markdown with Thinning

Environment Variables

Tips & Best Practices

Next Steps