Extending the Research

This guide covers how to extend the Obstetrics RAG Benchmark project with new data sources, retrieval parameters, and research contributions.

Extension Points

The benchmark provides several extension points for researchers:

RAG Architectures: Add novel retrieval strategies (~/workspace/source/src/rag/)
Model Integration: Test new LLMs and SLMs (~/workspace/source/src/common/model_provider.py)
Evaluation Metrics: Extend RAGAS evaluation (~/workspace/source/src/evaluation/ragas_evaluator.py)
Data Sources: Integrate new medical corpora (~/workspace/source/data/)

Adding New Data Sources

The benchmark uses a medical corpus on pregnancy and childbirth. To add new data sources:

Prepare Document Chunks

Create a JSON file in data/chunks/ with the following structure:

[
  {
    "content": "Your medical text content here...",
    "source": "document_name.pdf",
    "page_number": 1,
    "chunk_id": "chunk_001"
  }
]

Each chunk should contain:

content: The text content of the chunk
source: Original document filename
page_number: Page number in source document
chunk_id: Unique identifier for the chunk

Generate Embeddings

Run the embedding generation script to create vector representations:

python scripts/create_embeddings.py

This will:

Load chunks from data/chunks/chunks_final.json
Generate embeddings using OpenAI’s text-embedding-3-small
Store them in ChromaDB at data/embeddings/chroma_db/

Update Collection Name

If using a different medical domain, update the collection name in RAG implementations:

# In src/rag/simple.py, hybrid.py, etc.
collection_name = "your_collection_name"

vectorstore = Chroma(
    persist_directory=str(chroma_db_dir),
    embedding_function=embeddings,
    collection_name=collection_name,
)

Update Ground Truth Questions

Modify the evaluation dataset in src/evaluation/ragas_evaluator.py:49-90:

DATA_GT = [
    {
        "question": "Your domain-specific question?",
        "ground_truth": "Expected answer based on your corpus."
    },
    # Add more question-answer pairs...
]

Modifying Retrieval Parameters

Each RAG implementation allows retrieval parameter tuning:

Adjusting Number of Retrieved Documents

# In src/rag/simple.py:54
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})  # Change k value

# In src/rag/hybrid.py:66,74
bm25_retriever.k = 5  # Adjust BM25 retrieval count
semantic_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

Tuning Hybrid Retrieval Weights

# In src/rag/hybrid.py:77-82
ensemble_weight_bm25 = 0.5      # Lexical weight
ensemble_weight_semantic = 0.5   # Semantic weight

ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, semantic_retriever],
    weights=[ensemble_weight_bm25, ensemble_weight_semantic]
)

Best Practice: Document parameter changes in evaluation metadata to track their impact on performance.

Modifying Temperature for Generation

# For creative generation (HyDE hypothetical documents)
llm_hyde = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.7)

# For deterministic answers
llm_answer = ChatOpenAI(model_name="gpt-4o", temperature=0)

Experimentation Best Practices

1. Version Control Your Experiments

Create separate branches for experimental changes:

git checkout -b experiment/new-chunking-strategy

2. Track Configuration Changes

Document experiments in your evaluation metadata:

metadata = {
    "experiment_name": "increased_context_window",
    "retrieval_k": 10,
    "changes": "Doubled retrieved documents from 5 to 10",
    "hypothesis": "More context improves faithfulness scores"
}

3. Run Comparative Evaluations

Always compare against baseline:

# Baseline
python scripts/run_evaluation.py simple

# Your modification
python scripts/run_evaluation.py simple

4. Analyze Results Systematically

Compare metrics across configurations:

Faithfulness: Did more context reduce hallucinations?
Answer Relevancy: Are answers still focused on the question?
Context Precision: Is the retrieved context more relevant?
Context Recall: Are we capturing all necessary information?

Contributing to the Project

We welcome research contributions that advance RAG techniques for medical Q&A:

Contribution Areas

Novel RAG Architectures

Implement and evaluate new retrieval strategies

Model Integration

Add domain-specialized medical language models

Evaluation Extensions

Propose additional metrics or analysis methods

Results & Analysis

Contribute comparative insights and visualizations

Contribution Workflow

Fork the Repository

git clone https://github.com/NicolasHoyosDevs/RAG-Benchmark.git
cd RAG-Benchmark

Create Feature Branch

Use descriptive names for your branch:

git checkout -b feature/semantic-reranking

Implement Your Changes

Follow the existing code structure and patterns. Add comprehensive docstrings.

Run Evaluations

Evaluate your changes across multiple models:

python scripts/run_evaluation.py multi-model your-rag-type

Document Your Methodology

Include:

Hypothesis and motivation
Implementation details
Experimental setup
Results summary and analysis

Submit Pull Request

Create a PR with:

Clear description of changes
Evaluation results comparison
Analysis of improvements/trade-offs

Research Guidelines

Reproducibility

Fix Random Seeds: Set random seeds for reproducible results
Document Dependencies: Update requirements.txt with new packages
Save Configurations: Store all hyperparameters in configuration files

Statistical Rigor

Run multiple trials to account for variance
Report mean and standard deviation for metrics
Use appropriate statistical tests for comparisons

Ethical Considerations

When extending to new medical domains:

Ensure data sources are properly licensed
Validate medical accuracy with domain experts
Include appropriate disclaimers about clinical use
Respect patient privacy in any real-world data

Next Steps

Adding RAG Architectures

Learn how to implement new RAG strategies

Integrating Models

Add new LLMs to the model registry

Customizing Metrics

Extend evaluation with custom metrics

API Reference

Explore the complete API documentation

Get Started

Core Concepts

RAG Architectures

Evaluation

Guides

Extension Points

Adding New Data Sources

Modifying Retrieval Parameters

Adjusting Number of Retrieved Documents

Tuning Hybrid Retrieval Weights

Modifying Temperature for Generation

Experimentation Best Practices

1. Version Control Your Experiments

2. Track Configuration Changes

3. Run Comparative Evaluations

4. Analyze Results Systematically

Contributing to the Project

Contribution Areas

Novel RAG Architectures

Model Integration

Evaluation Extensions

Results & Analysis

Contribution Workflow

Research Guidelines

Reproducibility

Statistical Rigor

Ethical Considerations

Next Steps

Adding RAG Architectures

Integrating Models

Customizing Metrics

API Reference

Build docs developers (and LLMs) love

Get Started

Core Concepts

RAG Architectures

Evaluation

Guides

​Extension Points

​Adding New Data Sources

​Modifying Retrieval Parameters

​Adjusting Number of Retrieved Documents

​Tuning Hybrid Retrieval Weights

​Modifying Temperature for Generation

​Experimentation Best Practices

​1. Version Control Your Experiments

​2. Track Configuration Changes

​3. Run Comparative Evaluations

​4. Analyze Results Systematically

​Contributing to the Project

​Contribution Areas

Novel RAG Architectures

Model Integration

Evaluation Extensions

Results & Analysis

​Contribution Workflow

​Research Guidelines

​Reproducibility

​Statistical Rigor

​Ethical Considerations

​Next Steps

Adding RAG Architectures

Integrating Models

Customizing Metrics

API Reference

Build docs developers (and LLMs) love

Extension Points

Adding New Data Sources

Modifying Retrieval Parameters

Adjusting Number of Retrieved Documents

Tuning Hybrid Retrieval Weights

Modifying Temperature for Generation

Experimentation Best Practices

1. Version Control Your Experiments

2. Track Configuration Changes

3. Run Comparative Evaluations

4. Analyze Results Systematically

Contributing to the Project

Contribution Areas

Contribution Workflow

Research Guidelines

Reproducibility

Statistical Rigor

Ethical Considerations

Next Steps