Multi-Query Rewriter RAG

Overview

Multi-Query Rewriter RAG is an advanced retrieval strategy that:

Generates multiple rewritten versions of the original query
Retrieves documents for each query variation
Combines and re-ranks results with weighted scoring
Returns diverse, high-quality documents from the merged pool

This approach improves retrieval recall by exploring different phrasings, perspectives, and aspects of the original question.

How It Works

Pipeline Steps

Query Analysis: Receive user’s original question
Multi-Query Generation: Generate 3 query variations using different rewriting strategies:
- Standalone rewrite: Make the query self-contained and specific
- Synonym expansion: Rephrase using alternative medical terminology
- Context expansion: Expand to include related aspects and complications
Multi-Retrieval: Retrieve top-5 documents for each of the 3 rewritten queries (15 candidates total)
Weighted Re-ranking: Combine results with query-position weighting to reduce redundancy
Deduplication: Remove duplicate documents using content-based identification
Final Selection: Select top 8 diverse documents for answer generation
Answer Generation: Generate final answer from merged, diverse context

Query position weighting penalizes later queries (more speculative rewrites) to balance precision and recall: Query 1 weight = 1.0, Query 2 = 0.95, Query 3 = 0.90.

Key Features

Three rewriting strategies: Covers different aspects of query reformulation
Multi-perspective retrieval: Each query variant surfaces different documents
Weighted fusion: Earlier (more faithful) queries have higher influence
Automatic deduplication: Prevents redundant documents in final context
Larger context window: Returns 8 documents vs. 5 in simpler methods
Detailed query tracking: Returns all rewritten queries for analysis

Implementation Details

Query Rewriting Templates

# Template 1: Standalone, specific rewrite
REPHRASE_TEMPLATE_1 = """
Rewrite this question to be a standalone, specific query about pregnancy and childbirth.

Original question: {question}

Instructions:
- Maintain the medical/obstetric context if relevant.
- Be specific and clear in medical terms.
- Focus on pregnancy, childbirth, prenatal care, or maternal health.
- Ensure the question is complete and self-contained.

Standalone question:
"""

# Template 2: Synonym expansion
REPHRASE_TEMPLATE_2 = """
Rephrase this question about pregnancy and childbirth using synonyms and 
alternative medical terms.

Original question: {question}

Instructions:
- Use precise medical terminology.
- Include synonyms and alternative terms.
- Maintain the meaning but change the wording.
- Focus on clinical and obstetric aspects.

Rephrased question:
"""

# Template 3: Context expansion
REPHRASE_TEMPLATE_3 = """
Expand this question to include related aspects and additional context about 
pregnancy and childbirth.

Base question: {question}

Instructions:
- Expand the question to include related aspects.
- Add context about complications, prevention, or care.
- Include possible variations or special cases.
- Keep the focus on maternal and perinatal health.

Expanded question:
"""

Core Processing Function

def process_rewriter_query(
    question: str, 
    custom_rewriter_llm: ChatOpenAI = None, 
    custom_answer_llm: ChatOpenAI = None, 
    max_final_docs: int = 8
) -> Dict[str, Any]:
    """
    Processes a query using the multi-query rewriting RAG pipeline.
    
    Args:
        question (str): The user's question.
        custom_rewriter_llm (ChatOpenAI, optional): Custom model for query rewriting.
        custom_answer_llm (ChatOpenAI, optional): Custom model for answer generation.
        max_final_docs (int): The maximum number of documents to return.
    
    Returns:
        Dict[str, Any]: Answer, contexts, rewritten queries, and detailed metrics.
    """
    # 1. Generate rewritten queries and track metrics
    rewritten_queries = []
    rewrite_input_tokens, rewrite_output_tokens, rewrite_cost = 0, 0, 0
    
    for prompt in REPHRASE_PROMPTS:
        rewritten_query, rewrite_metrics = _invoke_text_with_usage(
            current_rewriter_llm,
            prompt.format(question=question)
        )
        rewritten_queries.append(rewritten_query)
        rewrite_input_tokens += rewrite_metrics["input_tokens"]
        rewrite_output_tokens += rewrite_metrics["output_tokens"]
        rewrite_cost += rewrite_metrics["cost"]
    
    # 2. Retrieve documents for each rewritten query
    all_docs_with_scores = []
    doc_ids_seen = set()
    
    for i, query in enumerate(rewritten_queries, 1):
        results = vectorstore.similarity_search_with_score(query, k=5)
        for doc, distance in results:
            similarity = max(0.0, 1.0 - distance)
            doc_id = doc.page_content[:100]  # Use content prefix as ID
            
            if doc_id not in doc_ids_seen:
                doc_ids_seen.add(doc_id)
                # Penalize queries from later, more speculative prompts
                query_weight = 1.0 - (i - 1) * 0.05
                all_docs_with_scores.append((doc, similarity * query_weight))
    
    # 3. Re-rank and select the best documents
    all_docs_with_scores.sort(key=lambda x: x[1], reverse=True)
    retrieved_docs = [doc for doc, _ in all_docs_with_scores[:max_final_docs]]
    
    # 4. Format context and generate final answer
    formatted_context = format_docs(retrieved_docs)
    answer, answer_metrics = _invoke_text_with_usage(
        current_answer_llm,
        qa_prompt.format_messages(context=formatted_context, question=question)
    )
    
    # 5. Consolidate and return all information
    return {
        'answer': answer,
        'contexts': [doc.page_content for doc in retrieved_docs],
        'retrieved_documents': retrieved_docs,
        'rewritten_queries': rewritten_queries,
        'metrics': {...}
    }

Example Query Rewrites

For the original query: “¿Qué debo hacer si tengo contracciones?” The system might generate:

¿Cuáles son los pasos a seguir cuando una mujer embarazada experimenta 
contracciones uterinas durante el tercer trimestre del embarazo?

Each variant retrieves different documents, improving overall coverage.

Usage with query_for_evaluation()

from src.rag.rewriter import query_for_evaluation

# Basic usage with default models
result = query_for_evaluation(
    question="¿Cuáles son los síntomas del parto prematuro?"
)

# With custom models for each stage
result = query_for_evaluation(
    question="¿Qué es la diabetes gestacional?",
    rewriter_model="gpt-3.5-turbo",  # Query rewriting
    answer_model="gpt-4o"             # Final answer generation
)

# With custom LLM instances
from langchain_openai import ChatOpenAI
rewriter_llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.3)
answer_llm = ChatOpenAI(model_name="gpt-4o", temperature=0)

result = query_for_evaluation(
    question="¿Qué cuidados necesito en el embarazo?",
    custom_rewriter_llm=rewriter_llm,
    custom_answer_llm=answer_llm
)

Return Structure

{
    "question": str,
    "answer": str,
    "contexts": List[str],           # Up to 8 contexts
    "source_documents": List,
    "metadata": {
        "num_contexts": 8,
        "retrieval_method": "multi_query_rewrite",
        "rewrite_count": 3,
        "llm_model": "gpt-4o",
        "rewriter_model": "gpt-3.5-turbo",
        "provider": "openai",
        "execution_time": 5.82,
        "input_tokens": 3124,          # Total across all LLM calls
        "output_tokens": 487,
        "total_cost": 0.004523,
        "usage_source": "provider"
    }
}

When to Use This Approach

Best For

Ambiguous queries: Questions that could be interpreted multiple ways
Incomplete information: Vague or underspecified questions
Maximum recall: When you need to find all relevant documents
Exploratory search: When users might not know exact terminology
Complex topics: Multi-faceted questions that span different aspects
Synonym-rich domains: Medical/technical fields with multiple terms for same concepts

Advantages Over Other Methods

Highest recall: Multiple queries cast a wider net for relevant documents
Handles ambiguity: Different rewrites explore different interpretations
Vocabulary robustness: Synonym expansion catches different terminologies
Comprehensive coverage: Expansion strategy includes related aspects
Explicit query diversity: Each rewrite targets different retrieval angles

Trade-offs

Highest cost: 3 rewrite LLM calls + 1 answer call (~$0.005-0.008 per query)
Highest latency: Multiple LLM calls + multiple retrievals (~5-7 seconds)
Potential noise: More retrievals may include less relevant documents
Complex metrics tracking: Must track costs across multiple LLM invocations
May over-expand: Expansion can drift from original intent

Multi-query rewriting is the most expensive architecture in terms of both cost and latency. Use it when retrieval quality is critical and you need maximum recall, but consider simpler methods for cost-sensitive or latency-sensitive applications.

Performance Characteristics

Speed

Query rewriting: ~2-3 seconds (3 × gpt-3.5-turbo calls)
Multi-retrieval: ~1-2 seconds (3 × semantic search)
Answer generation: ~1-2 seconds (1 × gpt-4o call)
Total: ~5-8 seconds end-to-end

Cost

Query rewrites: ~$0.0003-0.0006 (3 × gpt-3.5-turbo, ~50 tokens each)
Embeddings: ~$0.00003 (3 × query embeddings)
Answer generation: ~$0.003-0.006 (gpt-4o with larger context)
Total: ~$0.004-0.008 per query (highest among all architectures)

Quality

Excellent recall: Best at finding all relevant documents
Good for ambiguity: Multiple interpretations increase coverage
Variable precision: More documents may include some less relevant ones
Context richness: 8 documents provide comprehensive information
Query-dependent: Quality depends on rewrite quality

Configuration and Tuning

Rewriter Model Temperature

# Conservative (more faithful rewrites)
llm_rewriter = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.1)

# Balanced (default)
llm_rewriter = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.3)

# Creative (more diverse rewrites)
llm_rewriter = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.5)

Number of Final Documents

# Concise context (faster, cheaper LLM calls)
result = process_rewriter_query(question, max_final_docs=5)

# Balanced (default)
result = process_rewriter_query(question, max_final_docs=8)

# Comprehensive context (maximum information)
result = process_rewriter_query(question, max_final_docs=12)

Query Weighting Strategy

The current implementation uses linear decay:

# Current: Linear decay
query_weight = 1.0 - (i - 1) * 0.05  # 1.0, 0.95, 0.90

# Alternative: Exponential decay (more aggressive)
query_weight = 0.9 ** (i - 1)  # 1.0, 0.9, 0.81

# Alternative: Equal weighting (trust all rewrites equally)
query_weight = 1.0  # 1.0, 1.0, 1.0

Comparison with Other Architectures

Feature	Simple Semantic	HyDE	Multi-Query (This)
Query processing	Direct	Generate hypothesis	Generate 3 variants
LLM calls	1	2	4
Retrieval passes	1	1	3
Final docs	5	5	8
Best for	Clear queries	Vocabulary gaps	Ambiguous queries
Recall	Good	Good	Excellent
Precision	Good	Variable	Variable
Cost	Lowest	Medium	Highest
Latency	~2s	~5s	~6s

Advanced: Custom Rewriting Strategies

You can define custom rewriting prompts for your domain:

# Add a fourth rewrite strategy focused on complications
COMPLICATIONS_TEMPLATE = """
Rewrite this pregnancy/childbirth question to focus on potential 
complications, risk factors, and warning signs.

Original question: {question}

Complication-focused question:
"""

REPHRASE_PROMPTS = [
    PromptTemplate.from_template(REPHRASE_TEMPLATE_1),
    PromptTemplate.from_template(REPHRASE_TEMPLATE_2),
    PromptTemplate.from_template(REPHRASE_TEMPLATE_3),
    PromptTemplate.from_template(COMPLICATIONS_TEMPLATE)  # Add custom
]

Error Handling and Robustness

The system gracefully handles edge cases:

If two rewrites are identical, deduplication removes the duplicate
If a rewrite fails, the system can continue with successful rewrites
If no documents match a rewrite, it’s skipped without affecting other queries

Metrics and Observability

The implementation provides detailed cost breakdowns:

result['metrics'] = {
    'rewrite_input_tokens': 234,
    'rewrite_output_tokens': 156,
    'rewrite_cost': 0.000456,
    'answer_input_tokens': 2890,
    'answer_output_tokens': 331,
    'answer_cost': 0.004067,
    'total_input_tokens': 3124,
    'total_output_tokens': 487,
    'total_cost': 0.004523
}

This allows you to track exactly where costs are incurred.

Source Files

Implementation: ~/workspace/source/src/rag/rewriter.py:163-244
Rewriting prompts: ~/workspace/source/src/rag/rewriter.py:64-104
Retrieval and fusion: ~/workspace/source/src/rag/rewriter.py:184-212
Evaluation interface: ~/workspace/source/src/rag/rewriter.py:247-323

Get Started

Core Concepts

RAG Architectures

Evaluation

Guides

Overview

How It Works

Pipeline Steps

Key Features

Implementation Details

Query Rewriting Templates

Core Processing Function

Example Query Rewrites

Usage with query_for_evaluation()

Return Structure

When to Use This Approach

Best For

Advantages Over Other Methods

Trade-offs

Performance Characteristics

Speed

Cost

Quality

Configuration and Tuning

Rewriter Model Temperature

Number of Final Documents

Query Weighting Strategy

Comparison with Other Architectures

Advanced: Custom Rewriting Strategies

Error Handling and Robustness

Metrics and Observability

Source Files

Build docs developers (and LLMs) love

Get Started

Core Concepts

RAG Architectures

Evaluation

Guides

​Overview

​How It Works

​Pipeline Steps

​Key Features

​Implementation Details

​Query Rewriting Templates

​Core Processing Function

​Example Query Rewrites

​Usage with query_for_evaluation()

​Return Structure

​When to Use This Approach

​Best For

​Advantages Over Other Methods

​Trade-offs

​Performance Characteristics

​Speed

​Cost

​Quality

​Configuration and Tuning

​Rewriter Model Temperature

​Number of Final Documents

​Query Weighting Strategy

​Comparison with Other Architectures

​Advanced: Custom Rewriting Strategies

​Error Handling and Robustness

​Metrics and Observability

​Source Files

Build docs developers (and LLMs) love

Overview

How It Works

Pipeline Steps

Key Features

Implementation Details

Query Rewriting Templates

Core Processing Function

Example Query Rewrites

Usage with query_for_evaluation()

Return Structure

When to Use This Approach

Best For

Advantages Over Other Methods

Trade-offs

Performance Characteristics

Speed

Cost

Quality

Configuration and Tuning

Rewriter Model Temperature

Number of Final Documents

Query Weighting Strategy

Comparison with Other Architectures

Advanced: Custom Rewriting Strategies

Error Handling and Robustness

Metrics and Observability

Source Files