Tree Search Strategies

Overview

Once you’ve generated a PageIndex tree structure, you need effective strategies to search and retrieve relevant sections. Unlike traditional vector-based RAG (Retrieval Augmented Generation), PageIndex leverages hierarchical structure for more targeted retrieval.

Why Tree-Based Search?

Advantages Over Vector Search

Structural Awareness: Understand document hierarchy and relationships
Explainable Results: Clear reasoning for why sections were selected
Context Preservation: Retrieve parent/child sections together when needed
Expert Knowledge Integration: Easily incorporate domain-specific search preferences
No Embedding Drift: Structure is explicit, not dependent on embedding model semantics

When to Use Tree Search

Tree search excels when:

Documents have clear hierarchical structure
Section relationships matter for understanding
You need to explain retrieval decisions
Domain expertise can guide search priority

Basic LLM Tree Search

The simplest strategy is to use an LLM to analyze the tree structure and select relevant nodes.

Simple Tree Search Prompt

from pageindex import page_index
import json

# Generate tree structure
result = page_index('document.pdf', if_add_node_summary='yes')
tree_structure = result['structure']

# Search query
query = "What are the key findings about customer retention?"

# Tree search prompt
prompt = f"""
You are given a query and the tree structure of a document.
You need to find all nodes that are likely to contain the answer.

Query: {query}

Document tree structure: {json.dumps(tree_structure, indent=2)}

Reply in the following JSON format:
{{
  "thinking": <your reasoning about which nodes are relevant>,
  "node_list": [node_id1, node_id2, ...]
}}
"""

# Call LLM (example with OpenAI)
import openai
response = openai.ChatCompletion.create(
    model="gpt-4o-2024-11-20",
    messages=[{"role": "user", "content": prompt}],
    response_format={"type": "json_object"}
)

result = json.loads(response.choices[0].message.content)
selected_nodes = result['node_list']
print(f"Selected nodes: {selected_nodes}")
print(f"Reasoning: {result['thinking']}")

Retrieving Node Content

Once you have selected node IDs, retrieve their content:

def find_node_by_id(structure, node_id):
    """Recursively find a node by its ID."""
    for node in structure:
        if node.get('node_id') == node_id:
            return node
        if 'nodes' in node:
            found = find_node_by_id(node['nodes'], node_id)
            if found:
                return found
    return None

# Retrieve selected nodes
selected_content = []
for node_id in selected_nodes:
    node = find_node_by_id(tree_structure, node_id)
    if node:
        selected_content.append({
            'node_id': node_id,
            'title': node['title'],
            'summary': node.get('summary', ''),
            'text': node.get('text', '')  # if you generated with --if-add-node-text yes
        })

print(json.dumps(selected_content, indent=2))

Advanced Search Strategies

1. Hierarchical Expansion

When a parent node is relevant, include its children:

def get_node_with_children(structure, node_id):
    """Get a node and all its descendants."""
    node = find_node_by_id(structure, node_id)
    if not node:
        return None
    
    result = {
        'node_id': node['node_id'],
        'title': node['title'],
        'summary': node.get('summary', ''),
        'children': []
    }
    
    if 'nodes' in node:
        for child in node['nodes']:
            result['children'].append(get_node_with_children([child], child['node_id']))
    
    return result

# Get node with full subtree
expanded_node = get_node_with_children(tree_structure, '0003')

2. Multi-Level Search

Search at different granularity levels:

Level 1: Top-Level Selection

First, identify relevant top-level sections:

top_level_prompt = f"""
Query: {query}

Top-level sections:
{json.dumps([{"node_id": n["node_id"], "title": n["title"], "summary": n.get("summary")} 
              for n in tree_structure], indent=2)}

Select the most relevant top-level sections.
"""

Level 2: Subsection Selection

For each selected top-level section, search its subsections:

for parent_id in selected_top_level:
    parent_node = find_node_by_id(tree_structure, parent_id)
    if 'nodes' in parent_node:
        subsection_prompt = f"""
        Query: {query}
        
        Parent section: {parent_node['title']}
        
        Subsections:
        {json.dumps(parent_node['nodes'], indent=2)}
        
        Select relevant subsections.
        """

Level 3: Final Content Retrieval

Retrieve the most specific relevant content.

3. Similarity + Structure Hybrid

Combine embedding similarity with structural search:

import numpy as np
from openai import OpenAI

client = OpenAI()

def get_embedding(text):
    """Get embedding for text."""
    response = client.embeddings.create(
        input=text,
        model="text-embedding-3-small"
    )
    return response.data[0].embedding

def cosine_similarity(a, b):
    """Calculate cosine similarity."""
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Get query embedding
query_embedding = get_embedding(query)

# Calculate similarity for each node
node_scores = []
for node in tree_structure:
    node_text = f"{node['title']} {node.get('summary', '')}"
    node_embedding = get_embedding(node_text)
    similarity = cosine_similarity(query_embedding, node_embedding)
    node_scores.append({
        'node_id': node['node_id'],
        'title': node['title'],
        'similarity': similarity
    })

# Sort by similarity
top_nodes = sorted(node_scores, key=lambda x: x['similarity'], reverse=True)[:5]

# Use LLM to refine selection based on structure
refinement_prompt = f"""
Query: {query}

Top similar nodes by embedding:
{json.dumps(top_nodes, indent=2)}

Full tree structure:
{json.dumps(tree_structure, indent=2)}

Consider both similarity scores and document structure to select the most relevant nodes.
Also suggest any related parent/child nodes that should be included for context.
"""

Integrating Expert Knowledge

Unlike vector-based RAG, PageIndex makes it easy to incorporate domain expertise and user preferences.

Knowledge-Guided Search

Define Expert Knowledge

Create domain-specific search rules:

expert_knowledge = {
    "financial_queries": {
        "keywords": ["revenue", "EBITDA", "earnings", "financial"],
        "priority_sections": [
            "Item 7 (Management Discussion & Analysis)",
            "Item 8 (Financial Statements)"
        ],
        "guidance": "For financial metrics, prioritize MD&A and financial statements over other sections."
    },
    "risk_queries": {
        "keywords": ["risk", "uncertainty", "challenge"],
        "priority_sections": [
            "Item 1A (Risk Factors)",
            "Forward-Looking Statements"
        ],
        "guidance": "Risk-related queries should focus on the Risk Factors section."
    }
}

Match Query to Knowledge Domain

Identify which expert knowledge applies:

def detect_query_domain(query, knowledge_base):
    """Detect which domain the query belongs to."""
    query_lower = query.lower()
    
    for domain, info in knowledge_base.items():
        if any(keyword in query_lower for keyword in info['keywords']):
            return domain, info
    
    return None, None

domain, knowledge = detect_query_domain(query, expert_knowledge)

Enhanced Search with Preferences

Incorporate knowledge into the search prompt:

if knowledge:
    enhanced_prompt = f"""
    You are given a question and a tree structure of a document.
    You need to find all nodes that are likely to contain the answer.
    
    Query: {query}
    
    Document tree structure: {json.dumps(tree_structure, indent=2)}
    
    Expert Knowledge: {knowledge['guidance']}
    
    Priority Sections: {json.dumps(knowledge['priority_sections'])}
    
    Reply in the following JSON format:
    {{
      "thinking": <reasoning about which nodes are relevant, considering expert knowledge>,
      "node_list": [node_id1, node_id2, ...]
    }}
    """
else:
    # Use basic prompt
    enhanced_prompt = basic_search_prompt

Example: Financial Document Search

def search_10k_document(query, tree_structure):
    """Search a 10-K document with financial domain knowledge."""
    
    # Define 10-K specific preferences
    financial_10k_knowledge = """
    For 10-K filings:
    - Financial metrics (revenue, earnings, EBITDA) → Item 7 (MD&A) and Item 8
    - Business overview → Item 1 (Business)
    - Risk factors → Item 1A (Risk Factors)
    - Legal proceedings → Item 3 (Legal Proceedings)
    - Management team → Item 10 (Directors and Officers)
    """
    
    prompt = f"""
    You are an expert financial analyst searching a 10-K filing.
    
    Query: {query}
    
    Document tree structure: {json.dumps(tree_structure, indent=2)}
    
    Expert Knowledge:
    {financial_10k_knowledge}
    
    Select the most relevant sections based on both the query and your knowledge
    of 10-K document structure. Explain your reasoning.
    
    Reply in JSON format:
    {{
      "thinking": "<reasoning>",
      "node_list": ["node_id1", "node_id2", ...]
    }}
    """
    
    return prompt

# Example usage
query = "What were the EBITDA adjustments in Q4?"
tree = page_index('10k_filing.pdf', if_add_node_summary='yes')['structure']
search_prompt = search_10k_document(query, tree)

Advanced: Monte Carlo Tree Search (MCTS)

PageIndex’s dashboard and retrieval API use a combination of LLM tree search and value function-based Monte Carlo Tree Search (MCTS). Full implementation details will be released soon.

MCTS Overview

MCTS is a search algorithm that:

Selection: Navigate the tree using a value function
Expansion: Expand promising nodes
Simulation: Evaluate node relevance
Backpropagation: Update node values

Simplified MCTS Example

import math
from collections import defaultdict

class TreeSearchMCTS:
    def __init__(self, tree_structure, query, model="gpt-4o-2024-11-20"):
        self.tree = tree_structure
        self.query = query
        self.model = model
        self.visits = defaultdict(int)
        self.scores = defaultdict(float)
    
    def ucb1_score(self, node_id, parent_visits):
        """Upper Confidence Bound formula for balancing exploration/exploitation."""
        if self.visits[node_id] == 0:
            return float('inf')
        
        exploitation = self.scores[node_id] / self.visits[node_id]
        exploration = math.sqrt(2 * math.log(parent_visits) / self.visits[node_id])
        return exploitation + exploration
    
    def evaluate_node(self, node):
        """Evaluate node relevance using LLM."""
        prompt = f"""
        Query: {self.query}
        
        Node:
        - Title: {node['title']}
        - Summary: {node.get('summary', '')}
        
        Rate the relevance of this node to the query on a scale of 0-1.
        Reply with just a number.
        """
        
        # Call LLM and parse score
        # score = call_llm(prompt)
        score = 0.8  # Placeholder
        return score
    
    def search(self, iterations=100):
        """Perform MCTS iterations."""
        for _ in range(iterations):
            # Selection phase
            current_node = self.select_node()
            
            # Evaluation phase
            score = self.evaluate_node(current_node)
            
            # Backpropagation phase
            self.backpropagate(current_node['node_id'], score)
        
        # Return top nodes
        return self.get_best_nodes(k=5)

Retrieval Best Practices

Summary-First Approach: Use if_add_node_summary=yes for search, then retrieve full text only for selected nodes.

Context Window Management: When retrieving multiple nodes, consider:

Parent node context (prefix_summary)
Sibling nodes for completeness
Total token count for LLM context window

Avoid Over-Retrieval: Don’t retrieve entire subtrees unless necessary. Be selective to stay within context limits.

Cache Embeddings: If using hybrid search, cache node embeddings to avoid recalculating for every query.

Evaluation Metrics

Evaluate your search strategy:

Precision and Recall

def evaluate_retrieval(retrieved_nodes, ground_truth_nodes):
    """Calculate precision and recall."""
    retrieved_set = set(retrieved_nodes)
    ground_truth_set = set(ground_truth_nodes)
    
    true_positives = len(retrieved_set & ground_truth_set)
    
    precision = true_positives / len(retrieved_set) if retrieved_set else 0
    recall = true_positives / len(ground_truth_set) if ground_truth_set else 0
    
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
    
    return {
        'precision': precision,
        'recall': recall,
        'f1': f1
    }

Mean Reciprocal Rank (MRR)

def calculate_mrr(retrieved_nodes, relevant_nodes):
    """Calculate Mean Reciprocal Rank."""
    for rank, node_id in enumerate(retrieved_nodes, start=1):
        if node_id in relevant_nodes:
            return 1.0 / rank
    return 0.0

Complete Example: End-to-End Search

import json
from openai import OpenAI
from pageindex import page_index

client = OpenAI()

def search_document(pdf_path, query):
    """Complete end-to-end search pipeline."""
    
    # Step 1: Generate tree structure with summaries
    print("Generating tree structure...")
    result = page_index(
        pdf_path,
        if_add_node_summary='yes',
        if_add_node_id='yes'
    )
    tree = result['structure']
    
    # Step 2: Search with LLM
    print("Searching tree structure...")
    search_prompt = f"""
    Query: {query}
    
    Document tree: {json.dumps(tree, indent=2)}
    
    Find the most relevant nodes. Reply in JSON:
    {{
      "thinking": "<reasoning>",
      "node_list": ["id1", "id2", ...]
    }}
    """
    
    response = client.chat.completions.create(
        model="gpt-4o-2024-11-20",
        messages=[{"role": "user", "content": search_prompt}],
        response_format={"type": "json_object"}
    )
    
    search_result = json.loads(response.choices[0].message.content)
    
    # Step 3: Retrieve node content
    print("Retrieving content...")
    retrieved_content = []
    for node_id in search_result['node_list']:
        node = find_node_by_id(tree, node_id)
        if node:
            retrieved_content.append(node)
    
    # Step 4: Generate answer
    print("Generating answer...")
    context = "\n\n".join([
        f"Section: {node['title']}\n{node.get('summary', '')}"
        for node in retrieved_content
    ])
    
    answer_prompt = f"""
    Query: {query}
    
    Relevant context:
    {context}
    
    Answer the query based on the context provided.
    """
    
    answer_response = client.chat.completions.create(
        model="gpt-4o-2024-11-20",
        messages=[{"role": "user", "content": answer_prompt}]
    )
    
    return {
        'query': query,
        'reasoning': search_result['thinking'],
        'selected_nodes': search_result['node_list'],
        'answer': answer_response.choices[0].message.content
    }

# Usage
result = search_document('annual_report.pdf', 'What were the main revenue drivers in 2024?')
print(json.dumps(result, indent=2))

Next Steps

Learn about generating trees from PDFs
Learn about generating trees from Markdown
Explore all configuration options
Join the Discord community for help with retrieval strategies

Get Started

Core Concepts

Guides

Cookbook

Tutorials

Tree Search Strategies

Overview

Why Tree-Based Search?

Advantages Over Vector Search

When to Use Tree Search

Basic LLM Tree Search

Simple Tree Search Prompt

Retrieving Node Content

Advanced Search Strategies

1. Hierarchical Expansion

2. Multi-Level Search

3. Similarity + Structure Hybrid

Integrating Expert Knowledge

Knowledge-Guided Search

Example: Financial Document Search

Advanced: Monte Carlo Tree Search (MCTS)

MCTS Overview

Simplified MCTS Example

Retrieval Best Practices

Evaluation Metrics

Precision and Recall

Mean Reciprocal Rank (MRR)

Complete Example: End-to-End Search

Next Steps

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Cookbook

Tutorials

Documentation Index

​Overview

​Why Tree-Based Search?

​Advantages Over Vector Search

​When to Use Tree Search

​Basic LLM Tree Search

​Simple Tree Search Prompt

​Retrieving Node Content

​Advanced Search Strategies

​1. Hierarchical Expansion

​2. Multi-Level Search

​3. Similarity + Structure Hybrid

​Integrating Expert Knowledge

​Knowledge-Guided Search

​Example: Financial Document Search

​Advanced: Monte Carlo Tree Search (MCTS)

​MCTS Overview

​Simplified MCTS Example

​Retrieval Best Practices

​Evaluation Metrics

​Precision and Recall

​Mean Reciprocal Rank (MRR)

​Complete Example: End-to-End Search

​Next Steps

Build docs developers (and LLMs) love

Overview

Why Tree-Based Search?

Advantages Over Vector Search

When to Use Tree Search

Basic LLM Tree Search

Simple Tree Search Prompt

Retrieving Node Content

Advanced Search Strategies

1. Hierarchical Expansion

2. Multi-Level Search

3. Similarity + Structure Hybrid

Integrating Expert Knowledge

Knowledge-Guided Search

Example: Financial Document Search

Advanced: Monte Carlo Tree Search (MCTS)

MCTS Overview

Simplified MCTS Example

Retrieval Best Practices

Evaluation Metrics

Precision and Recall

Mean Reciprocal Rank (MRR)

Complete Example: End-to-End Search

Next Steps