Documentation Index Fetch the complete documentation index at: https://mintlify.com/avnlp/vectordb/llms.txt
Use this file to discover all available pages before exploring further.
The evaluation module provides standard information retrieval metrics for measuring the quality of document retrieval in RAG pipelines.
Overview
VectorDB implements five core retrieval metrics used in academic and industry benchmarking:
Recall@k - Fraction of relevant documents retrieved in top-k results
Precision@k - Fraction of top-k results that are relevant
MRR (Mean Reciprocal Rank) - Average of reciprocal ranks of first relevant document
NDCG@k (Normalized DCG) - Rank-aware metric normalized by ideal ranking
Hit Rate - Binary indicator if any relevant document appears in top-k
All metrics assume binary relevance (document is either relevant or not) and use 1-indexed ranks for mathematical correctness.
Recall@k
Measures the proportion of relevant documents that were successfully retrieved:
Recall@k = |relevant ∩ retrieved_top_k| / |relevant|
Range: 0.0 to 1.0 (higher is better)
Example: If there are 10 relevant documents and the top-5 results contain 3 of them, Recall@5 = 3/10 = 0.3
Precision@k
Measures the proportion of retrieved documents that are actually relevant:
Precision@k = |relevant ∩ retrieved_top_k| / k
Range: 0.0 to 1.0 (higher is better)
Example: If the top-5 results contain 3 relevant documents, Precision@5 = 3/5 = 0.6
MRR (Mean Reciprocal Rank)
Measures how quickly the first relevant document appears:
MRR = mean(1 / rank_of_first_relevant)
Range: 0.0 to 1.0 (higher is better)
Example: If the first relevant document appears at position 3, MRR = 1/3 = 0.333
NDCG@k (Normalized Discounted Cumulative Gain)
Rank-aware metric that gives more weight to relevant documents appearing earlier:
DCG@k = Σ(rel_i / log2(i + 2)) for i in range(k)
IDCG@k = DCG@k with all relevant docs ranked first (ideal)
NDCG@k = DCG@k / IDCG@k
Range: 0.0 to 1.0 (higher is better)
Note: Uses log2(i + 2) because log2(1) = 0
Hit Rate
Binary success metric indicating whether any relevant document was retrieved:
Hit Rate = 1 if any relevant in top-k, else 0
Range: 0.0 to 1.0 (higher is better)
Aggregation: Computed as proportion of queries with at least one hit
Data structures
RetrievalMetrics
Container for aggregated evaluation metrics:
@dataclass
class RetrievalMetrics :
recall_at_k: float = 0.0 # Proportion of relevant docs retrieved
precision_at_k: float = 0.0 # Proportion of retrieved docs that are relevant
mrr: float = 0.0 # Mean reciprocal rank
ndcg_at_k: float = 0.0 # Normalized discounted cumulative gain
hit_rate: float = 0.0 # Proportion of queries with at least one hit
num_queries: int = 0 # Number of queries evaluated
k: int = 5 # Cutoff value for top-k metrics
Usage:
from vectordb.utils.evaluation import RetrievalMetrics
metrics = RetrievalMetrics(
recall_at_k = 0.65 ,
precision_at_k = 0.80 ,
mrr = 0.72 ,
ndcg_at_k = 0.78 ,
hit_rate = 0.90 ,
num_queries = 100 ,
k = 5
)
# Convert to dictionary for JSON serialization
metrics_dict = metrics.to_dict()
# {"recall@5": 0.65, "precision@5": 0.80, "mrr": 0.72, ...}
QueryResult
Result for a single query evaluation:
@dataclass
class QueryResult :
query: str # Query string
retrieved_ids: list[ str ] # Retrieved document IDs (ranked)
retrieved_contents: list[ str ] # Retrieved document contents
relevant_ids: set[ str ] # Ground truth relevant IDs
scores: list[ float ] # Retrieval scores
Usage:
from vectordb.utils.evaluation import QueryResult
result = QueryResult(
query = "What is machine learning?" ,
retrieved_ids = [ "doc1" , "doc2" , "doc3" ],
retrieved_contents = [ "ML is..." , "AI involves..." , "Deep learning..." ],
relevant_ids = { "doc1" , "doc4" },
scores = [ 0.95 , 0.87 , 0.82 ]
)
EvaluationResult
Complete evaluation result for a retrieval pipeline:
@dataclass
class EvaluationResult :
metrics: RetrievalMetrics # Aggregated metrics
query_results: list[QueryResult] # Per-query results
pipeline_name: str # Pipeline identifier
dataset_name: str # Dataset identifier
config: dict[ str , Any] # Configuration used
Usage:
from vectordb.utils.evaluation import EvaluationResult
eval_result = EvaluationResult(
metrics = metrics,
query_results = query_results,
pipeline_name = "semantic_search_pinecone" ,
dataset_name = "triviaqa" ,
config = { "top_k" : 5 , "backend" : "pinecone" }
)
# Convert to dictionary
result_dict = eval_result.to_dict()
Computing metrics
Single-query metrics
Compute metrics for individual queries:
from vectordb.utils.evaluation import (
compute_recall_at_k,
compute_precision_at_k,
compute_mrr,
compute_ndcg_at_k,
compute_hit_rate
)
retrieved = [ "doc1" , "doc2" , "doc3" , "doc4" , "doc5" ]
relevant = { "doc1" , "doc3" , "doc7" }
k = 5
recall = compute_recall_at_k(retrieved, relevant, k)
# 2/3 = 0.666 (found 2 out of 3 relevant docs)
precision = compute_precision_at_k(retrieved, relevant, k)
# 2/5 = 0.4 (2 out of 5 retrieved docs are relevant)
mrr = compute_mrr(retrieved, relevant)
# 1/1 = 1.0 (first relevant doc at position 1)
ndcg = compute_ndcg_at_k(retrieved, relevant, k)
# Rank-aware score considering positions of relevant docs
hit = compute_hit_rate(retrieved, relevant, k)
# 1.0 (at least one relevant doc in top-5)
Aggregated metrics
Compute metrics across multiple queries:
from vectordb.utils.evaluation import QueryResult, evaluate_retrieval
query_results = [
QueryResult(
query = "query1" ,
retrieved_ids = [ "doc1" , "doc2" , "doc3" ],
retrieved_contents = [],
relevant_ids = { "doc1" , "doc4" },
scores = [ 0.95 , 0.87 , 0.82 ]
),
QueryResult(
query = "query2" ,
retrieved_ids = [ "doc5" , "doc6" , "doc7" ],
retrieved_contents = [],
relevant_ids = { "doc7" , "doc8" },
scores = [ 0.91 , 0.88 , 0.85 ]
),
# ... more queries
]
# Compute aggregated metrics
metrics = evaluate_retrieval(query_results, k = 5 )
print ( f "Recall@5: { metrics.recall_at_k :.3f} " )
print ( f "Precision@5: { metrics.precision_at_k :.3f} " )
print ( f "MRR: { metrics.mrr :.3f} " )
print ( f "NDCG@5: { metrics.ndcg_at_k :.3f} " )
print ( f "Hit Rate: { metrics.hit_rate :.3f} " )
print ( f "Queries: { metrics.num_queries } " )
Evaluation workflow
Typical evaluation pipeline:
from vectordb.dataloaders import DataloaderCatalog
from vectordb.utils.evaluation import QueryResult, evaluate_retrieval
# 1. Load dataset and evaluation queries
loader = DataloaderCatalog.create( "triviaqa" , split = "test" , limit = 100 )
dataset = loader.load()
eval_queries = extract_evaluation_queries(dataset) # Custom function
# 2. Index documents in vector database
db.upsert(documents)
# 3. Execute evaluation queries
query_results = []
for eval_query in eval_queries:
# Get query embedding
query_vector = embed(eval_query.query)
# Retrieve documents
retrieved_docs = db.query( vector = query_vector, top_k = 10 )
# Create QueryResult
query_results.append(
QueryResult(
query = eval_query.query,
retrieved_ids = [doc.id for doc in retrieved_docs],
retrieved_contents = [doc.content for doc in retrieved_docs],
relevant_ids = set (eval_query.relevant_doc_ids),
scores = [doc.score for doc in retrieved_docs]
)
)
# 4. Compute metrics
metrics = evaluate_retrieval(query_results, k = 10 )
# 5. Display results
print ( f " \n Evaluation Results:" )
for key, value in metrics.to_dict().items():
print ( f " { key } : { value } " )
Choosing k values
The cutoff k determines how many top results are considered:
k=5 Use case: Strict precision requirements
Scenario: Chat interfaces with limited context window
k=10 Use case: Balanced evaluation
Scenario: Standard RAG pipelines with reranking
k=20 Use case: Recall-focused evaluation
Scenario: Multi-stage retrieval (retrieve many, rerank to few)
k=100 Use case: First-stage retrieval quality
Scenario: Evaluating retriever before compression/filtering
Binary vs graded relevance
VectorDB metrics assume binary relevance (document is relevant or not):
# Binary relevance
relevant_ids = { "doc1" , "doc3" , "doc7" } # Either relevant or not
For graded relevance (documents have relevance scores 0-3), you would need custom implementations. The current NDCG implementation treats all relevant documents as having relevance score 1.0.
Comparing pipelines
Evaluate multiple retrieval strategies:
from vectordb.utils.evaluation import evaluate_retrieval
pipelines = [
{ "name" : "semantic_search" , "results" : semantic_results},
{ "name" : "hybrid_search" , "results" : hybrid_results},
{ "name" : "with_reranking" , "results" : reranked_results}
]
for pipeline in pipelines:
metrics = evaluate_retrieval(pipeline[ "results" ], k = 10 )
print ( f " \n { pipeline[ 'name' ] } :" )
print ( f " Recall@10: { metrics.recall_at_k :.3f} " )
print ( f " NDCG@10: { metrics.ndcg_at_k :.3f} " )
print ( f " MRR: { metrics.mrr :.3f} " )
Example output:
semantic_search:
Recall@10: 0.652
NDCG@10: 0.701
MRR: 0.745
hybrid_search:
Recall@10: 0.712
NDCG@10: 0.758
MRR: 0.803
with_reranking:
Recall@10: 0.718
NDCG@10: 0.821
MRR: 0.867
Interpreting metrics
Recall@k
High recall (greater than 0.8): System finds most relevant documents
Medium recall (0.5-0.8): System misses some relevant documents
Low recall (less than 0.5): System misses many relevant documents
Impact: Low recall means users may not see important information.
Precision@k
High precision (greater than 0.8): Most retrieved documents are relevant
Medium precision (0.5-0.8): Some irrelevant documents retrieved
Low precision (less than 0.5): Many irrelevant documents retrieved
Impact: Low precision means users see too much noise.
MRR
High MRR (greater than 0.8): First relevant result appears very early (positions 1-2)
Medium MRR (0.5-0.8): First relevant result around positions 2-4
Low MRR (less than 0.5): First relevant result appears late (position 5+)
Impact: Low MRR means users must scroll to find relevant content.
NDCG@k
High NDCG (greater than 0.8): Relevant documents ranked highly
Medium NDCG (0.5-0.8): Relevant documents have mixed rankings
Low NDCG (less than 0.5): Relevant documents ranked poorly
Impact: Low NDCG means ranking quality is poor even if recall is high.
Hit Rate
High hit rate (greater than 0.9): Almost all queries retrieve at least one relevant document
Medium hit rate (0.7-0.9): Most queries successful
Low hit rate (less than 0.7): Many queries fail to find any relevant document
Impact: Low hit rate means many queries return zero useful results.
Statistical significance
For robust comparisons, consider:
Sample size: Evaluate on at least 100 queries for stable metrics
Multiple runs: Run evaluations multiple times if randomness is involved
Variance: Report standard deviation or confidence intervals
Domain coverage: Ensure evaluation queries cover all use cases
Best practices
Use multiple metrics Don’t rely on a single metric. Precision and recall trade off, so evaluate both.
Choose appropriate k Match k to your application’s context window (e.g., k=5 for chat, k=20 for reranking)
Segment by difficulty Analyze performance on easy vs hard queries separately for insights
Track over time Monitor metrics across model/config changes to detect regressions
Integration with pipelines
Evaluation metrics integrate with feature modules:
# haystack/semantic_search/search/pinecone.py
from vectordb.utils.evaluation import QueryResult, evaluate_retrieval
query_results = []
for query in eval_queries:
results = pipeline.run(query)
query_results.append(
QueryResult(
query = query.query,
retrieved_ids = [doc.id for doc in results[ "documents" ]],
retrieved_contents = [doc.content for doc in results[ "documents" ]],
relevant_ids = set (query.relevant_doc_ids),
scores = [doc.score for doc in results[ "documents" ]]
)
)
metrics = evaluate_retrieval(query_results, k = 10 )
print ( f "Pipeline evaluation: { metrics.to_dict() } " )
This consistent evaluation approach allows fair comparison across different backends, frameworks, and retrieval strategies.