Five QA Dataset Pipelines for DSPy-Opt RAG Optimization

DSPy-Opt ships five ready-to-use RAG pipeline implementations, each targeting a different question answering benchmark. Together they span the full breadth of QA difficulty — from fast single-hop lookups and false-premise debunking through multi-hop entity chaining, biomedical abstract reasoning, trivia factoid recall, and broad general-knowledge retrieval over cleaned Wikipedia articles. Every pipeline shares the same five-stage architecture and the same optimizer scripts pattern, so switching datasets or adding a new one requires only minimal changes.

Supported Datasets

Dataset	HuggingFace ID	Description	Complexity Type
FreshQA (SealQA)	`vtllms/sealqa`	Dynamic QA benchmark with diverse question types and false-premise debunking	Single-hop
HotpotQA	`hotpotqa/hotpot_qa`	Multi-hop questions with strong supervision for supporting facts	Multi-hop
PubMedQA	`qiaojin/PubMedQA`	Biomedical QA based on PubMed abstracts	Biomedical
TriviaQA	`mandarjoshi/trivia_qa`	Question-answer-evidence triples authored by trivia enthusiasts	Trivia, factoid
Wikipedia	`wikimedia/wikipedia` + WikiQA	Large-scale cleaned Wikipedia articles with WikiQA pairs for QA	General knowledge

Shared Five-Stage Pipeline Architecture

Although every dataset uses a different module class and metadata schema, all five pipelines execute the same logical stages in forward(). Each stage is implemented as a reusable component drawn from dspy_opt.utils.

Query Rewriting

QueryRewriter expands synonyms, removes conversational noise, and produces a search-optimized version of the original question.

Sub-Query Generation

SubQueryGenerator decomposes the rewritten query into multiple self-contained sub-queries that can be executed in parallel, boosting retrieval coverage for complex questions.

Metadata Extraction

MetadataExtractor calls an LLM with structured-output generation to parse each query against a dataset-specific JSON schema, producing typed filter values for Weaviate.

Hybrid Retrieval

WeaviateRetriever performs hybrid search (vector + keyword) on the main query and all sub-queries, applying the extracted metadata as filters. Results are aggregated and deduplicated.

Answer Generation

dspy.ChainOfThought receives the unique retrieved passages and generates a final answer together with an explicit reasoning trace.

Optimizer Scripts Pattern

All five dataset directories follow the same file-naming convention. Each dataset exposes scripts for every supported DSPy optimizer and a dedicated evaluation script:

<dataset>_indexing.py                    # Index documents to Weaviate
<dataset>_rag_module.py                  # Pipeline class definition
<dataset>_rag_mipro.py                   # MIPROv2 optimization
<dataset>_rag_copro.py                   # COPRO optimization
<dataset>_rag_bootstrap_few_shot.py      # BootstrapFewShot optimization
<dataset>_rag_simba.py                   # SIMBA optimization
<dataset>_rag_gepa.py                    # GEPA optimization
<dataset>_rag_evaluation.py              # DeepEval evaluation

Each script reads its hyperparameters from a matching _config.yml file that lives in the same directory. All scripts must be run from inside the dataset subdirectory so that relative config paths resolve correctly.

All pipelines share the same five DeepEval metrics: Answer Relevancy, Faithfulness, Contextual Precision, Contextual Recall, and Contextual Relevancy. Thresholds are configured per-dataset in each _evaluation_config.yml.

Dataset Pages

FreshQA

Single-hop and false-premise debunking on the SealQA longseal benchmark. Uses title and category metadata filters.

HotpotQA

Multi-hop reasoning over distractor documents. Uses title and category metadata filters with entity chaining.

PubMedQA

Biomedical domain QA from PubMed abstracts. Uses a rich six-field metadata schema for high-precision filtering.

TriviaQA

Factoid and trivia questions with typed metadata including an enum content_type and numeric year field.

Wikipedia

General knowledge QA over cleaned Wikipedia articles indexed from wikimedia/wikipedia with WikiQA pairs.

Get Started

Core Concepts

Pipeline Components

Dataset Pipelines

Guides

Five QA Dataset Pipelines for DSPy-Opt RAG Optimization

Supported Datasets

Shared Five-Stage Pipeline Architecture

Optimizer Scripts Pattern

Dataset Pages

FreshQA

HotpotQA

PubMedQA

TriviaQA

Wikipedia

Build docs developers (and LLMs) love

Get Started

Core Concepts

Pipeline Components

Dataset Pipelines

Guides

Documentation Index

​Supported Datasets

​Shared Five-Stage Pipeline Architecture

​Optimizer Scripts Pattern

​Dataset Pages

FreshQA

HotpotQA

PubMedQA

TriviaQA

Wikipedia

Build docs developers (and LLMs) love

Supported Datasets

Shared Five-Stage Pipeline Architecture

Optimizer Scripts Pattern

Dataset Pages