Function Signature
pageindex/page_index_md.py:243
Description
Themd_to_tree() function generates a PageIndex tree structure from Markdown documents. It analyzes Markdown headers (#, ##, ###, etc.) to build a hierarchical structure similar to the PDF processing, but optimized for Markdown’s inherent structure.
Important: This function is asynchronous and must be called with asyncio.run() or await.
Markdown files must use proper header hierarchy with
# symbols. If your Markdown was converted from PDF or HTML, ensure the conversion tool preserved the original heading structure. For best results with converted documents, use PageIndex OCR.Parameters
Path to the Markdown file. Must be a valid file path ending in
.md or .markdown.Whether to apply tree thinning. When enabled, small nodes (below
min_token_threshold) are merged with their parent nodes to simplify the tree structure.Minimum token count for a node when thinning is enabled. Nodes with fewer tokens are merged with their parents. Only used if
if_thinning=True.Whether to generate AI summaries for each node. Valid values:
"yes" or "no". For leaf nodes, creates summary field. For parent nodes, creates prefix_summary field.Token threshold for summary generation. If a node’s text is shorter than this threshold, the full text is used instead of generating a summary. Only used if
if_add_node_summary="yes".OpenAI model to use for summary generation and token counting. Examples:
"gpt-4o-2024-11-20", "gpt-4o", "gpt-4.1"Whether to generate a one-sentence description for the entire document. Valid values:
"yes" or "no". Only works if if_add_node_summary="yes".Whether to include full text content in each node. Valid values:
"yes" or "no"Whether to add sequential node IDs to the tree structure. Valid values:
"yes" or "no"Return Value
Dictionary containing the Markdown structure:
Example Usage
Basic Usage
With Summaries and Description
With Tree Thinning
Full-Featured Processing
Async Context Usage
Markdown Format Requirements
Valid Markdown:Tree Thinning Behavior
Whenif_thinning=True:
- Calculates token count for each node (including all descendants)
- If a node has fewer tokens than
min_token_threshold:- Merges all child content into parent node
- Removes child nodes from tree
- Updates parent’s text and token count
- Processes from leaves to root to ensure accurate merging
Performance Notes
- Processing is much faster than PDF (no OCR or complex parsing)
- Summary generation is the slowest step
- Token counting uses the specified model’s tokenizer
- Large documents may take several minutes if summaries are enabled
See Also
- page_index() - For PDF documents
- CLI Reference - Command-line interface for Markdown
- ConfigLoader - Configuration system