Skip to main content

Overview

Simple Semantic RAG is the baseline retrieval-augmented generation architecture that uses pure semantic search to find relevant documents in a ChromaDB vector store. It embeds the user’s query and retrieves the most semantically similar documents based on vector similarity. This approach is implemented in src/rag/simple.py and serves as the foundation for comparison with more advanced retrieval strategies.

How It Works

The Simple Semantic RAG pipeline follows these steps:
  1. Query Embedding: The user’s question is embedded using OpenAI’s text-embedding-3-small model
  2. Semantic Retrieval: ChromaDB performs a similarity search to find the top-k most similar document chunks
  3. Context Formatting: Retrieved documents are formatted with metadata (source, page number)
  4. Answer Generation: An LLM generates the final answer based on the retrieved context
The retriever is configured to return the top 5 most similar documents by default (k=5).

Key Features

  • Pure semantic search: Relies entirely on vector similarity for retrieval
  • Simple and fast: No query preprocessing or result fusion steps
  • Consistent embeddings: Uses the same embedding model for indexing and retrieval
  • Ordered by relevance: Documents are naturally ranked by semantic similarity

Implementation Details

Retriever Configuration

# Configure OpenAI embeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Load ChromaDB vector store
vectorstore = Chroma(
    persist_directory=str(chroma_db_dir),
    embedding_function=embeddings,
    collection_name="guia_embarazo_parto",
)

# Configure the semantic retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

Core Processing Function

The process_semantic_query() function handles the complete pipeline:
def process_semantic_query(query: str, custom_llm: ChatOpenAI = None) -> Dict[str, Any]:
    """
    Processes a query using the simple semantic RAG pipeline.
    
    Args:
        query (str): The user's question.
        custom_llm (ChatOpenAI, optional): Custom LLM to use. If None, uses default llm.
    
    Returns:
        Dict[str, Any]: A dictionary with the final answer, contexts, and detailed metrics.
    """
    # 1. Retrieve similar documents
    retrieved_docs = retriever.invoke(query)
    
    # 2. Format context
    formatted_context = format_docs(retrieved_docs)
    
    # 3. Generate final answer
    current_llm = custom_llm if custom_llm else llm
    response = current_llm.invoke(qa_prompt.format_messages(
        context=formatted_context,
        question=query
    ))
    
    # 4. Return response and metrics
    return {
        'answer': response.content,
        'contexts': [doc.page_content for doc in retrieved_docs],
        'retrieved_documents': retrieved_docs,
        'metrics': {...}
    }

Answer Prompt Template

The system uses a structured prompt that emphasizes:
  • Context-only answers: Base responses exclusively on provided medical context
  • Relevance ordering: Prioritize the first few documents as most relevant
  • Integrated responses: Provide direct, well-written paragraph answers
  • Spanish language: All medical answers are in Spanish
qa_template = """
You are a medical expert specializing in pregnancy and childbirth. 
Your task is to analyze the provided medical context and answer the user's question accurately and concisely.

STRICT INSTRUCTIONS:
1.  **Base your answer exclusively on the information within the MEDICAL CONTEXT section.**
2.  *The context is ordered by relevance.* Give the highest priority to the first few documents.
3.  *Provide a direct and integrated answer.* Start with a direct answer to the question.
4.  *If the context does not contain enough information to answer the question, state that clearly.*
5.  *remember always answer in spanish*

MEDICAL CONTEXT (ordered by relevance):
{context}

QUESTION: {question}

DETAILED MEDICAL ANSWER:
"""

Usage with query_for_evaluation()

The query_for_evaluation() function provides a standardized interface for benchmark evaluation:
from src.rag.simple import query_for_evaluation

# Basic usage with default model (gpt-4o)
result = query_for_evaluation(
    question="¿Cuáles son los síntomas del parto prematuro?"
)

# With custom model name
result = query_for_evaluation(
    question="¿Qué es la preeclampsia?",
    llm_model="gpt-4o-mini"
)

# With custom LLM instance
from langchain_openai import ChatOpenAI
custom_llm = ChatOpenAI(model_name="gpt-4o", temperature=0.2)
result = query_for_evaluation(
    question="¿Cuándo debo ir al hospital?",
    custom_llm=custom_llm
)

Return Structure

{
    "question": str,           # Original question
    "answer": str,             # Generated answer
    "contexts": List[str],     # Retrieved document contents
    "source_documents": List,  # Full document objects
    "metadata": {
        "num_contexts": 5,
        "retrieval_method": "semantic_only",
        "llm_model": "gpt-4o",
        "provider": "openai",
        "embedding_model": "text-embedding-3-small",
        "execution_time": 2.34,
        "input_tokens": 1523,
        "output_tokens": 187,
        "total_cost": 0.002156,
        "usage_source": "provider",
        "cost_source": "calculated"
    }
}

When to Use This Approach

Best For

  • Well-structured queries: Questions that naturally align with document content
  • Baseline comparison: Establishing performance benchmarks for more complex methods
  • Fast prototyping: Quick setup with minimal configuration
  • Dense semantic content: Documents where meaning is more important than exact keywords

Limitations

  • Keyword mismatches: May miss documents with exact terminology but different semantic framing
  • Query ambiguity: Short or vague queries may not embed well
  • Vocabulary gap: Struggles when query vocabulary differs significantly from document vocabulary
  • No diversification: Can return very similar documents without variety
Simple semantic search performs best when queries and documents use similar vocabulary and phrasing. Consider hybrid approaches for queries with specific medical terminology or acronyms.

Performance Characteristics

Speed

  • Fast retrieval: Single embedding + vector search
  • Low latency: ~1-3 seconds total for typical queries
  • No preprocessing overhead: Direct query-to-embedding conversion

Cost

  • Embedding cost: ~$0.00001 per query (text-embedding-3-small)
  • LLM cost: Depends on model (gpt-4o: ~$0.002-0.005 per query)
  • Total: Most cost-efficient RAG architecture

Quality

  • Good for clear queries: High precision when query intent is unambiguous
  • Baseline recall: May miss relevant documents with different phrasing
  • Context quality: Retrieved documents are semantically similar but may lack diversity

Source Files

  • Implementation: ~/workspace/source/src/rag/simple.py:105-146
  • Evaluation interface: ~/workspace/source/src/rag/simple.py:148-214

Build docs developers (and LLMs) love