Simple Semantic RAG

Overview

Simple Semantic RAG is the baseline retrieval-augmented generation architecture that uses pure semantic search to find relevant documents in a ChromaDB vector store. It embeds the user’s query and retrieves the most semantically similar documents based on vector similarity. This approach is implemented in src/rag/simple.py and serves as the foundation for comparison with more advanced retrieval strategies.

How It Works

The Simple Semantic RAG pipeline follows these steps:

Query Embedding: The user’s question is embedded using OpenAI’s text-embedding-3-small model
Semantic Retrieval: ChromaDB performs a similarity search to find the top-k most similar document chunks
Context Formatting: Retrieved documents are formatted with metadata (source, page number)
Answer Generation: An LLM generates the final answer based on the retrieved context

The retriever is configured to return the top 5 most similar documents by default (k=5).

Key Features

Pure semantic search: Relies entirely on vector similarity for retrieval
Simple and fast: No query preprocessing or result fusion steps
Consistent embeddings: Uses the same embedding model for indexing and retrieval
Ordered by relevance: Documents are naturally ranked by semantic similarity

Implementation Details

Retriever Configuration

# Configure OpenAI embeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Load ChromaDB vector store
vectorstore = Chroma(
    persist_directory=str(chroma_db_dir),
    embedding_function=embeddings,
    collection_name="guia_embarazo_parto",
)

# Configure the semantic retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

Core Processing Function

The process_semantic_query() function handles the complete pipeline:

def process_semantic_query(query: str, custom_llm: ChatOpenAI = None) -> Dict[str, Any]:
    """
    Processes a query using the simple semantic RAG pipeline.
    
    Args:
        query (str): The user's question.
        custom_llm (ChatOpenAI, optional): Custom LLM to use. If None, uses default llm.
    
    Returns:
        Dict[str, Any]: A dictionary with the final answer, contexts, and detailed metrics.
    """
    # 1. Retrieve similar documents
    retrieved_docs = retriever.invoke(query)
    
    # 2. Format context
    formatted_context = format_docs(retrieved_docs)
    
    # 3. Generate final answer
    current_llm = custom_llm if custom_llm else llm
    response = current_llm.invoke(qa_prompt.format_messages(
        context=formatted_context,
        question=query
    ))
    
    # 4. Return response and metrics
    return {
        'answer': response.content,
        'contexts': [doc.page_content for doc in retrieved_docs],
        'retrieved_documents': retrieved_docs,
        'metrics': {...}
    }

Answer Prompt Template

The system uses a structured prompt that emphasizes:

Context-only answers: Base responses exclusively on provided medical context
Relevance ordering: Prioritize the first few documents as most relevant
Integrated responses: Provide direct, well-written paragraph answers
Spanish language: All medical answers are in Spanish

qa_template = """
You are a medical expert specializing in pregnancy and childbirth. 
Your task is to analyze the provided medical context and answer the user's question accurately and concisely.

STRICT INSTRUCTIONS:
1.  **Base your answer exclusively on the information within the MEDICAL CONTEXT section.**
2.  *The context is ordered by relevance.* Give the highest priority to the first few documents.
3.  *Provide a direct and integrated answer.* Start with a direct answer to the question.
4.  *If the context does not contain enough information to answer the question, state that clearly.*
5.  *remember always answer in spanish*

MEDICAL CONTEXT (ordered by relevance):
{context}

QUESTION: {question}

DETAILED MEDICAL ANSWER:
"""

Usage with query_for_evaluation()

The query_for_evaluation() function provides a standardized interface for benchmark evaluation:

from src.rag.simple import query_for_evaluation

# Basic usage with default model (gpt-4o)
result = query_for_evaluation(
    question="¿Cuáles son los síntomas del parto prematuro?"
)

# With custom model name
result = query_for_evaluation(
    question="¿Qué es la preeclampsia?",
    llm_model="gpt-4o-mini"
)

# With custom LLM instance
from langchain_openai import ChatOpenAI
custom_llm = ChatOpenAI(model_name="gpt-4o", temperature=0.2)
result = query_for_evaluation(
    question="¿Cuándo debo ir al hospital?",
    custom_llm=custom_llm
)

Return Structure

{
    "question": str,           # Original question
    "answer": str,             # Generated answer
    "contexts": List[str],     # Retrieved document contents
    "source_documents": List,  # Full document objects
    "metadata": {
        "num_contexts": 5,
        "retrieval_method": "semantic_only",
        "llm_model": "gpt-4o",
        "provider": "openai",
        "embedding_model": "text-embedding-3-small",
        "execution_time": 2.34,
        "input_tokens": 1523,
        "output_tokens": 187,
        "total_cost": 0.002156,
        "usage_source": "provider",
        "cost_source": "calculated"
    }
}

When to Use This Approach

Best For

Well-structured queries: Questions that naturally align with document content
Baseline comparison: Establishing performance benchmarks for more complex methods
Fast prototyping: Quick setup with minimal configuration
Dense semantic content: Documents where meaning is more important than exact keywords

Limitations

Keyword mismatches: May miss documents with exact terminology but different semantic framing
Query ambiguity: Short or vague queries may not embed well
Vocabulary gap: Struggles when query vocabulary differs significantly from document vocabulary
No diversification: Can return very similar documents without variety

Simple semantic search performs best when queries and documents use similar vocabulary and phrasing. Consider hybrid approaches for queries with specific medical terminology or acronyms.

Performance Characteristics

Speed

Fast retrieval: Single embedding + vector search
Low latency: ~1-3 seconds total for typical queries
No preprocessing overhead: Direct query-to-embedding conversion

Cost

Embedding cost: ~$0.00001 per query (text-embedding-3-small)
LLM cost: Depends on model (gpt-4o: ~$0.002-0.005 per query)
Total: Most cost-efficient RAG architecture

Quality

Good for clear queries: High precision when query intent is unambiguous
Baseline recall: May miss relevant documents with different phrasing
Context quality: Retrieved documents are semantically similar but may lack diversity

Source Files

Implementation: ~/workspace/source/src/rag/simple.py:105-146
Evaluation interface: ~/workspace/source/src/rag/simple.py:148-214

Get Started

Core Concepts

RAG Architectures

Evaluation

Guides

Overview

How It Works

Key Features

Implementation Details

Retriever Configuration

Core Processing Function

Answer Prompt Template

Usage with query_for_evaluation()

Return Structure

When to Use This Approach

Best For

Limitations

Performance Characteristics

Speed

Cost

Quality

Source Files

Build docs developers (and LLMs) love

Get Started

Core Concepts

RAG Architectures

Evaluation

Guides

​Overview

​How It Works

​Key Features

​Implementation Details

​Retriever Configuration

​Core Processing Function

​Answer Prompt Template

​Usage with query_for_evaluation()

​Return Structure

​When to Use This Approach

​Best For

​Limitations

​Performance Characteristics

​Speed

​Cost

​Quality

​Source Files

Build docs developers (and LLMs) love

Overview

How It Works

Key Features

Implementation Details

Retriever Configuration

Core Processing Function

Answer Prompt Template

Usage with query_for_evaluation()

Return Structure

When to Use This Approach

Best For

Limitations

Performance Characteristics

Speed

Cost

Quality

Source Files