Documentation Index
Fetch the complete documentation index at: https://mintlify.com/avnlp/dspy-opt/llms.txt
Use this file to discover all available pages before exploring further.
The Wikipedia pipeline provides broad general-knowledge QA by combining two separate HuggingFace datasets. The wikimedia/wikipedia dataset supplies the document corpus — cleaned, full-text English Wikipedia articles from the November 2023 snapshot — which is indexed into a Wikipedia Weaviate collection. The microsoft/wiki_qa dataset then provides the question-answer pairs used for pipeline optimization and evaluation. A lightweight title / category metadata schema guides retrieval filtering across the large and topically diverse article corpus.
Dataset
| Property | Value |
|---|
| HuggingFace ID (indexing) | wikimedia/wikipedia |
| Subset (indexing) | 20231101.en |
| Split (indexing) | train |
| HuggingFace ID (QA pairs) | microsoft/wiki_qa |
| Subset (QA pairs) | default |
| Split (QA pairs) | train (90 % / 10 % train-test split) |
| Weaviate collection | Wikipedia |
| Complexity type | General knowledge QA |
Pipeline Class
The WikipediaRAG class is defined in wikipedia_rag_module.py and subclasses dspy.Module. It applies the same five-stage architecture shared across all dataset pipelines, configured with the title / category metadata schema appropriate for the broad Wikipedia article corpus.
from dspy_opt.wikipedia.wikipedia_rag_module import WikipediaRAG
class WikipediaRAG(dspy.Module):
def __init__(
self,
query_rewriter: QueryRewriter,
sub_query_generator: SubQueryGenerator,
metadata_extractor: MetadataExtractor,
metadata_schema: Dict[str, Any],
weaviate_retriever: WeaviateRetriever,
embedding_model: SentenceTransformer,
top_k: int = 3,
): ...
def forward(self, question: str) -> dspy.Prediction:
"""Execute the complete RAG pipeline."""
forward() Return Fields
forward() returns a dspy.Prediction containing the following fields:
| Field | Description |
|---|
question | The original input question (passed through unchanged) |
rewritten_query | Search-optimized version of the question produced by QueryRewriter |
sub_queries | List of decomposed sub-queries from SubQueryGenerator |
retrieved_context | Deduplicated list of passages returned by WeaviateRetriever |
answer | Concise answer generated by dspy.ChainOfThought |
reasoning | Explanation of how the answer was derived |
The MetadataExtractor parses each query against the following JSON schema to produce Weaviate filter values:
| Field | Type | Description |
|---|
title | string | The main title or name of the subject |
category | string | Primary category or type of content |
metadata_schema:
properties:
title:
type: "string"
description: "The main title or name of the subject"
category:
type: "string"
description: "Primary category or type of content"
The Wikipedia corpus is significantly larger than the other four datasets. Using top_k: 5 with metadata filtering on title and category is important to keep retrieval latency reasonable and limit the context window fed to the answer LLM.
Models
| Role | Model |
|---|
| Answer LLM | groq/qwen3-32b |
| Extractor LLM | groq/llama-3.3-70b-versatile |
| Embedding | Qwen/Qwen3-Embedding-0.6B |
| Evaluator LLM | groq/qwen3-32b |
Scripts
| Script | Description |
|---|
wikipedia_indexing.py | Load wikimedia/wikipedia from HuggingFace, extract metadata, embed, and store in Weaviate |
wikipedia_rag_module.py | Pipeline class definition — imported by optimizer and evaluation scripts |
wikipedia_rag_mipro.py | Run MIPROv2 optimization |
wikipedia_rag_copro.py | Run COPRO optimization |
wikipedia_rag_bootstrap_few_shot.py | Run BootstrapFewShot optimization |
wikipedia_rag_simba.py | Run SIMBA optimization |
wikipedia_rag_gepa.py | Run GEPA optimization |
wikipedia_rag_evaluation.py | Evaluate the optimized pipeline with DeepEval metrics |
Configuration Files
| File | Description |
|---|
wikipedia_indexing_config.yml | Indexing parameters: embedding model, metadata schema, collection name |
wikipedia_rag_mipro_config.yml | MIPROv2 parameters: max_bootstrapped_demos, max_labeled_demos, auto |
wikipedia_rag_copro_config.yml | COPRO parameters: breadth, depth, init_temperature |
wikipedia_rag_bootstrap_few_shot_config.yml | BootstrapFewShot parameters: max_bootstrapped_demos, max_rounds |
wikipedia_rag_simba_config.yml | SIMBA parameters: bsize, num_candidates, max_steps, max_demos |
wikipedia_rag_gepa_config.yml | GEPA parameters: max_full_evals, reflection_minibatch_size, candidate_selection_strategy |
wikipedia_rag_evaluation_config.yml | Evaluation settings and DeepEval metric thresholds |
MIPROv2 Configuration
answer_llm:
model: "groq/qwen3-32b"
api_key_env: "GROQ_API_KEY"
extractor_llm:
model: "groq/llama-3.3-70b-versatile"
api_key_env: "GROQ_API_KEY"
embedding:
embedding_model: "Qwen/Qwen3-Embedding-0.6B"
tokenizer_kwargs:
padding_side: "left"
weaviate:
url_env: "WEAVIATE_URL"
api_key_env: "WEAVIATE_API_KEY"
collection_name: "Wikipedia"
top_k: 5
dataset:
name: "microsoft/wiki_qa"
subset: "default"
split: "train"
test_size: 0.1
optimizer:
max_bootstrapped_demos: 3
max_labeled_demos: 16
auto: "medium"
Running the Pipeline
All scripts must be run from the wikipedia/ directory so that relative config file paths resolve correctly.
# Index Wikipedia articles into Weaviate
cd src/dspy_opt/wikipedia
python wikipedia_indexing.py
# Run MIPROv2 optimization
cd src/dspy_opt/wikipedia
python wikipedia_rag_mipro.py
# Run SIMBA optimization
cd src/dspy_opt/wikipedia
python wikipedia_rag_simba.py
# Run GEPA optimization
cd src/dspy_opt/wikipedia
python wikipedia_rag_gepa.py
# Evaluate the optimized pipeline
cd src/dspy_opt/wikipedia
python wikipedia_rag_evaluation.py
Programmatic Usage
import dspy
from sentence_transformers import SentenceTransformer
from dspy_opt.wikipedia.wikipedia_rag_module import WikipediaRAG
from dspy_opt.utils.metadata_extractor import MetadataExtractor
from dspy_opt.utils.query_rewriter import QueryRewriter
from dspy_opt.utils.sub_query_generator import SubQueryGenerator
from dspy_opt.utils.weaviate_retriever import WeaviateRetriever
# Configure LLMs
answer_lm = dspy.LM("groq/qwen3-32b", api_key="your-groq-api-key")
extractor_lm = dspy.LM("groq/llama-3.3-70b-versatile", api_key="your-groq-api-key")
dspy.configure(lm=answer_lm)
# Initialize components
query_rewriter = QueryRewriter()
sub_query_generator = SubQueryGenerator()
metadata_extractor = MetadataExtractor(extractor_llm=extractor_lm)
embedding_model = SentenceTransformer("Qwen/Qwen3-Embedding-0.6B")
retriever = WeaviateRetriever(
weaviate_url="your-weaviate-url",
weaviate_api_key="your-weaviate-api-key",
collection_name="Wikipedia",
top_k=5,
)
metadata_schema = {
"properties": {
"title": {"type": "string", "description": "The main title or name of the subject"},
"category": {"type": "string", "description": "Primary category or type of content"},
}
}
# Build and run the pipeline
pipeline = WikipediaRAG(
query_rewriter=query_rewriter,
sub_query_generator=sub_query_generator,
metadata_extractor=metadata_extractor,
metadata_schema=metadata_schema,
weaviate_retriever=retriever,
embedding_model=embedding_model,
top_k=5,
)
result = pipeline("What is the speed of light in a vacuum?")
print(result.answer)
print(result.reasoning)