Overview
PageIndex can generate hierarchical tree structures from markdown documents by parsing the heading structure. Unlike PDFs, markdown has explicit hierarchy through heading levels (#, ##, ###, etc.), making structure extraction faster and more precise.
Quick Start
Basic Usage with CLI
Generate a tree structure from a markdown file:This will create a JSON file at
./results/document_structure.json.CLI Parameters
Required Parameters
Path to the markdown file to process. Must have
.md or .markdown extension.Model Configuration
LLM model to use for summary generation (only used if
--if-add-node-summary yes).Markdown-Specific Parameters
Whether to apply tree thinning to merge small sections. Options:
yes, no.Tree thinning merges child sections into their parent if the total token count is below the threshold.Minimum token threshold for tree thinning. Nodes with fewer tokens (including all descendants) will be merged with their parent.
Token threshold for generating summaries. Sections shorter than this will use their full text instead of generating a summary.
Content Enrichment Parameters
Whether to add unique node IDs to each node. Options:
yes, no.Whether to generate AI summaries for each node. Options:
yes, no.Whether to generate an overall document description. Options:
yes, no.Whether to include full text content in each node. Options:
yes, no.Programmatic API
Using md_to_tree() Function
The md_to_tree() function is an async function that processes markdown files:
Function Parameters
Path to the markdown file.
Whether to apply tree thinning to merge small sections.
Minimum token count for keeping sections separate (used with thinning).
Generate AI summaries for nodes (
yes/no).Token threshold below which full text is used instead of summary.
LLM model identifier for summary generation.
Generate document-level description (
yes/no).Include full text in nodes (
yes/no).Add unique identifiers to nodes (
yes/no).Processing Pipeline
PageIndex follows this workflow when processing markdown:Header Extraction
Parse markdown to identify all headers (
#, ##, ###, etc.):- Skip headers inside code blocks (triple backticks)
- Record header level and line number
- Extract header text
Content Extraction
For each header, extract its associated content:
- Content starts from the header line
- Content ends at the next header of any level (or end of file)
- Store both content and line number references
Tree Thinning (Optional)
If
if_thinning=True:- Calculate token counts for each section (including all descendants)
- Merge sections below the threshold into their parent
- Update parent text to include merged child content
Tree Construction
Build hierarchical structure based on heading levels:
- Level 1 (
#) becomes root nodes - Level 2 (
##) becomes children of level 1 - And so on for deeper levels
- Assign sequential node IDs
Output Format
The generated JSON structure contains:Field Descriptions
- doc_name: Filename without extension
- doc_description: Overall document summary (if
if_add_doc_description=yes) - structure: Array of top-level sections
- title: Section heading text
- node_id: Unique identifier (if
if_add_node_id=yes) - line_num: Line number where the section starts (1-indexed)
- summary: AI-generated section summary (if
if_add_node_summary=yes) - prefix_summary: Summary of content before child sections (for parent nodes with children)
- text: Full markdown content of the section (if
if_add_node_text=yes) - nodes: Child subsections (recursive structure)
Tree Thinning
Tree thinning is a markdown-specific feature that optimizes the tree structure for retrieval:How It Works
- Token Counting: Calculate total tokens for each section including all descendants
- Threshold Check: If a section’s total tokens < threshold, it’s a candidate for merging
- Merging: Child sections are merged into the parent’s text
- Removal: Child nodes are removed from the tree structure
Example
Before Thinning (with threshold = 5000 tokens):When to Use Thinning
Advanced Examples
Generate with Thinning and Summaries
Full Text Extraction (No Summaries)
Complete Processing with Description
Minimal Processing (Structure Only)
Tips & Best Practices
Performance Considerations
Speed
Markdown processing is generally faster than PDF processing because:- No OCR or PDF parsing required
- Header structure is explicit (no LLM calls for structure extraction)
- Only summaries require LLM calls (if enabled)
Cost
For a typical markdown document:- Structure extraction: Free (no LLM calls)
- With summaries: 1 LLM call per section
- With doc description: 1 additional LLM call for the entire document
Token Calculation
PageIndex uses the model’s tokenizer to count tokens. Different models have different tokenization:- GPT-4: ~750 words per 1000 tokens
- Claude: ~700 words per 1000 tokens
Comparison: Markdown vs PDF
| Feature | Markdown | |
|---|---|---|
| Speed | Fast | Slower |
| Structure Extraction | Explicit from headers | Requires LLM |
| Accuracy | Very high | High (after verification) |
| Token Reference | Line numbers | Page numbers |
| Thinning | Supported | Not applicable |
| TOC Detection | Not needed | Automatic |
Troubleshooting
Headers Not Detected
If some headers are missing:- Check that headers have a space after
#:# Titlenot#Title - Ensure headers are not inside code blocks
- Verify the markdown file is properly formatted
Incorrect Hierarchy
If the tree structure seems wrong:- Check your heading levels are consistent (don’t skip levels)
- Example: Don’t go from
#to###without##
File Not Found Error
Make sure:- The file path is correct
- The file has
.mdor.markdownextension - You have read permissions for the file
Next Steps
- Learn about PDF processing
- Explore all configuration options
- Implement tree search strategies for retrieval