md_to_tree()

Function Signature

async def md_to_tree(
    md_path,
    if_thinning=False,
    min_token_threshold=None,
    if_add_node_summary='no',
    summary_token_threshold=None,
    model=None,
    if_add_doc_description='no',
    if_add_node_text='no',
    if_add_node_id='yes'
)

Source: pageindex/page_index_md.py:243

Description

The md_to_tree() function generates a PageIndex tree structure from Markdown documents. It analyzes Markdown headers (#, ##, ###, etc.) to build a hierarchical structure similar to the PDF processing, but optimized for Markdown’s inherent structure. Important: This function is asynchronous and must be called with asyncio.run() or await.

Markdown files must use proper header hierarchy with # symbols. If your Markdown was converted from PDF or HTML, ensure the conversion tool preserved the original heading structure. For best results with converted documents, use PageIndex OCR.

Parameters

md_path

str

required

Path to the Markdown file. Must be a valid file path ending in .md or .markdown.

if_thinning

bool

default:"False"

Whether to apply tree thinning. When enabled, small nodes (below min_token_threshold) are merged with their parent nodes to simplify the tree structure.

min_token_threshold

int

default:"None"

Minimum token count for a node when thinning is enabled. Nodes with fewer tokens are merged with their parents. Only used if if_thinning=True.

if_add_node_summary

str

default:"no"

Whether to generate AI summaries for each node. Valid values: "yes" or "no". For leaf nodes, creates summary field. For parent nodes, creates prefix_summary field.

summary_token_threshold

int

default:"None"

Token threshold for summary generation. If a node’s text is shorter than this threshold, the full text is used instead of generating a summary. Only used if if_add_node_summary="yes".

model

str

default:"None"

OpenAI model to use for summary generation and token counting. Examples: "gpt-4o-2024-11-20", "gpt-4o", "gpt-4.1"

if_add_doc_description

str

default:"no"

Whether to generate a one-sentence description for the entire document. Valid values: "yes" or "no". Only works if if_add_node_summary="yes".

if_add_node_text

str

default:"no"

Whether to include full text content in each node. Valid values: "yes" or "no"

if_add_node_id

str

default:"yes"

Whether to add sequential node IDs to the tree structure. Valid values: "yes" or "no"

Return Value

result

dict

Dictionary containing the Markdown structure:

Show Return structure

doc_name

str

Document name (filename without extension)

doc_description

str

One-sentence AI-generated description (only if if_add_doc_description="yes" and if_add_node_summary="yes")

structure

list[dict]

Hierarchical tree structure. Each node contains:

Show Node structure

title

str

Header text from Markdown

node_id

str

Sequential identifier (e.g., “0001”) - only if if_add_node_id="yes"

line_num

int

Line number where the header appears in the Markdown file (1-indexed)

text

str

Full text content from this header to the next header - only if if_add_node_text="yes"

summary

str

AI-generated summary (leaf nodes) - only if if_add_node_summary="yes"

prefix_summary

str

AI-generated summary (parent nodes) - only if if_add_node_summary="yes"

nodes

list[dict]

Child nodes (sub-headers) with the same structure

Example Usage

Basic Usage

import asyncio
from pageindex.page_index_md import md_to_tree

# Process Markdown with default settings
result = asyncio.run(md_to_tree(
    md_path="document.md"
))

print(f"Document: {result['doc_name']}")
for node in result['structure']:
    print(f"- {node['title']} (line {node['line_num']})")

With Summaries and Description

import asyncio
from pageindex.page_index_md import md_to_tree

result = asyncio.run(md_to_tree(
    md_path="documentation.md",
    model="gpt-4o-2024-11-20",
    if_add_node_summary="yes",
    summary_token_threshold=200,
    if_add_doc_description="yes"
))

print(f"Document: {result['doc_name']}")
print(f"Description: {result.get('doc_description', 'N/A')}")

for node in result['structure']:
    summary = node.get('summary') or node.get('prefix_summary', 'N/A')
    print(f"\n{node['title']}")
    print(f"Summary: {summary}")

With Tree Thinning

import asyncio
from pageindex.page_index_md import md_to_tree

# Merge small sections into parents
result = asyncio.run(md_to_tree(
    md_path="large_document.md",
    if_thinning=True,
    min_token_threshold=5000,
    model="gpt-4o-2024-11-20"
))

print(f"Generated {len(result['structure'])} top-level sections")

Full-Featured Processing

import asyncio
from pageindex.page_index_md import md_to_tree

result = asyncio.run(md_to_tree(
    md_path="technical_manual.md",
    if_thinning=True,
    min_token_threshold=3000,
    if_add_node_summary="yes",
    summary_token_threshold=200,
    model="gpt-4o-2024-11-20",
    if_add_doc_description="yes",
    if_add_node_text="yes",
    if_add_node_id="yes"
))

# Save to JSON
import json
with open("output.json", "w", encoding="utf-8") as f:
    json.dump(result, f, indent=2, ensure_ascii=False)

Async Context Usage

from pageindex.page_index_md import md_to_tree

async def process_multiple_docs():
    # Process multiple Markdown files concurrently
    tasks = [
        md_to_tree("doc1.md", if_add_node_summary="yes", model="gpt-4o"),
        md_to_tree("doc2.md", if_add_node_summary="yes", model="gpt-4o"),
        md_to_tree("doc3.md", if_add_node_summary="yes", model="gpt-4o")
    ]
    
    results = await asyncio.gather(*tasks)
    return results

# Run
import asyncio
results = asyncio.run(process_multiple_docs())

Markdown Format Requirements

Headers must follow standard Markdown syntax:

Use # for level 1 headers
Use ## for level 2 headers
Use ### for level 3 headers, etc.
Headers inside code blocks (triple backticks) are ignored

Valid Markdown:

# Chapter 1: Introduction

Some content here.

## Section 1.1: Background

More content.

### Subsection 1.1.1: History

Detailed content.

Invalid/Problematic:

Chapter 1: Introduction  <!-- Missing # -->
=======================

Content with random #hashtags  <!-- Not a header -->

Tree Thinning Behavior

When if_thinning=True:

Calculates token count for each node (including all descendants)
If a node has fewer tokens than min_token_threshold:
- Merges all child content into parent node
- Removes child nodes from tree
- Updates parent’s text and token count
Processes from leaves to root to ensure accurate merging

Example:

# Before thinning (small subsections)
# - Chapter 1 (1000 tokens)
#   - Section 1.1 (200 tokens)
#   - Section 1.2 (150 tokens)
#
# After thinning (min_token_threshold=500)
# - Chapter 1 (1350 tokens, includes 1.1 and 1.2 text)

Performance Notes

Processing is much faster than PDF (no OCR or complex parsing)
Summary generation is the slowest step
Token counting uses the specified model’s tokenizer
Large documents may take several minutes if summaries are enabled

Python API

CLI

Cloud API

Function Signature

Description

Parameters

Return Value

Example Usage

Basic Usage

With Summaries and Description

With Tree Thinning

Full-Featured Processing

Async Context Usage

Markdown Format Requirements

Tree Thinning Behavior

Performance Notes

See Also

Build docs developers (and LLMs) love

Python API

CLI

Cloud API

Documentation Index

​Function Signature

​Description

​Parameters

​Return Value

​Example Usage

​Basic Usage

​With Summaries and Description

​With Tree Thinning

​Full-Featured Processing

​Async Context Usage

​Markdown Format Requirements

​Tree Thinning Behavior

​Performance Notes

​See Also

Build docs developers (and LLMs) love

Function Signature

Description

Parameters

Return Value

Example Usage

Basic Usage

With Summaries and Description

With Tree Thinning

Full-Featured Processing

Async Context Usage

Markdown Format Requirements

Tree Thinning Behavior

Performance Notes

See Also