Documentation Index Fetch the complete documentation index at: https://mintlify.com/avnlp/vectordb/llms.txt
Use this file to discover all available pages before exploring further.
Benchmarking allows you to measure and compare retrieval quality across different vector databases, embedding models, and pipeline configurations. VectorDB provides standardized evaluation utilities and datasets to support rigorous, reproducible benchmarking.
Evaluation metrics
VectorDB includes five core retrieval metrics:
Metric What it measures When to optimize Recall@k Fraction of relevant documents retrieved in top-k When missing relevant documents is costly Precision@k Fraction of top-k documents that are relevant When irrelevant results harm user experience MRR Mean Reciprocal Rank of first relevant document When users only look at top results NDCG@k Normalized Discounted Cumulative Gain When ranking order matters Hit Rate Percentage of queries with at least one relevant result When recall is binary (found vs not found)
# Recall@k = (relevant docs in top-k) / (total relevant docs)
Recall @ k = | relevant ∩ retrieved_top_k | / | relevant |
# Precision@k = (relevant docs in top-k) / k
Precision @ k = | relevant ∩ retrieved_top_k | / k
# MRR = mean(1 / rank_of_first_relevant)
MRR = mean( 1 / rank_i)
# NDCG@k = DCG@k / IDCG@k
NDCG @ k = Σ(rel_i / log2(i + 2 )) / IDCG @ k
# Hit Rate = 1 if any relevant in top-k, else 0
Hit_Rate = 1 if | relevant ∩ top_k | > 0 else 0
Supported datasets
VectorDB includes loaders for five benchmark datasets:
Open-domain question-answer pairs for general knowledge retrieval. dataloader :
type : "triviaqa"
split : "test"
limit : 100
Use case: General knowledge QA systems, broad domain retrieval
ARC (AI2 Reasoning Challenge)
Science reasoning questions requiring multi-hop inference. dataloader :
type : "arc"
split : "test"
limit : 200
Use case: Scientific and educational content retrieval
Factoid questions about popular entities. dataloader :
type : "popqa"
split : "test"
limit : 100
Use case: Entity-focused retrieval, celebrity and popular culture
Atomic facts for verification and hallucination detection. dataloader :
type : "factscore"
split : "test"
limit : 100
Use case: Fact verification, hallucination detection
Financial transcript Q&A for domain-specific RAG. dataloader :
type : "earnings_calls"
split : "test"
limit : 50
Use case: Financial domain, long-form transcripts
Running evaluations
Basic evaluation
Evaluate a single pipeline configuration:
from vectordb.utils.evaluation import evaluate_retrieval, QueryResult
from vectordb.dataloaders.evaluation import EvaluationExtractor
from vectordb.langchain.semantic_search import PineconeSemanticSearchPipeline
# Initialize pipeline
pipeline = PineconeSemanticSearchPipeline(
"configs/pinecone_triviaqa.yaml"
)
# Load evaluation queries
records = pipeline.load_dataset()
evaluation_queries = EvaluationExtractor.extract(records, limit = 100 )
# Run evaluation
query_results = []
for eval_query in evaluation_queries:
result = pipeline.search(eval_query.query, top_k = 10 )
query_results.append(
QueryResult(
query = eval_query.query,
retrieved_ids = [doc.id for doc in result[ "documents" ]],
relevant_ids = set (eval_query.relevant_doc_ids)
)
)
# Compute metrics
metrics = evaluate_retrieval(query_results, k = 10 )
print ( f "Results for Pinecone on TriviaQA (k=10):" )
print ( f " Recall@10: { metrics.recall_at_k :.3f} " )
print ( f " Precision@10: { metrics.precision_at_k :.3f} " )
print ( f " MRR: { metrics.mrr :.3f} " )
print ( f " NDCG@10: { metrics.ndcg_at_k :.3f} " )
print ( f " Hit Rate: { metrics.hit_rate :.3f} " )
print ( f " Queries: { metrics.num_queries } " )
Cross-database comparison
Compare the same configuration across multiple databases:
from vectordb.utils.evaluation import evaluate_retrieval, EvaluationResult
import json
databases = [
( "Pinecone" , "configs/pinecone_triviaqa.yaml" , PineconeSemanticSearchPipeline),
( "Weaviate" , "configs/weaviate_triviaqa.yaml" , WeaviateSemanticSearchPipeline),
( "Milvus" , "configs/milvus_triviaqa.yaml" , MilvusSemanticSearchPipeline),
( "Qdrant" , "configs/qdrant_triviaqa.yaml" , QdrantSemanticSearchPipeline),
]
results = {}
for db_name, config_path, pipeline_class in databases:
print ( f " \n Evaluating { db_name } ..." )
pipeline = pipeline_class(config_path)
pipeline.index()
query_results = []
for eval_query in evaluation_queries:
result = pipeline.search(eval_query.query, top_k = 10 )
query_results.append(
QueryResult(
query = eval_query.query,
retrieved_ids = [doc.id for doc in result[ "documents" ]],
relevant_ids = set (eval_query.relevant_doc_ids)
)
)
metrics = evaluate_retrieval(query_results, k = 10 )
results[db_name] = metrics.to_dict()
print ( f " Recall@10: { metrics.recall_at_k :.3f} " )
print ( f " MRR: { metrics.mrr :.3f} " )
# Save results
with open ( "benchmark_results.json" , "w" ) as f:
json.dump(results, f, indent = 2 )
Comparing retrieval strategies
Benchmark different retrieval approaches:
configs = [
( "Dense" , "configs/semantic_search.yaml" ),
( "Sparse" , "configs/sparse_search.yaml" ),
( "Hybrid" , "configs/hybrid_search.yaml" ),
( "Hybrid + Reranking" , "configs/hybrid_reranking.yaml" ),
]
for strategy_name, config_path in configs:
pipeline = PineconeSemanticSearchPipeline(config_path)
# ... run evaluation
print ( f " { strategy_name } : Recall@10= { metrics.recall_at_k :.3f} " )
Evaluation with reranking metrics
When using reranking, track additional quality metrics:
reranker :
type : "cross_encoder"
model : "BAAI/bge-reranker-v2-m3"
top_k : 5
evaluation :
enabled : true
metrics :
- contextual_recall
- contextual_precision
- answer_relevancy
- faithfulness
These metrics evaluate:
Contextual Recall : Do retrieved chunks contain information needed for the answer?
Contextual Precision : Are retrieved chunks relevant to the question?
Answer Relevancy : Does the generated answer address the question?
Faithfulness : Is the answer grounded in the retrieved context?
Cost-quality tradeoffs
Evaluate cost alongside quality for production deployments:
from vectordb.utils.evaluation import evaluate_retrieval
import time
candidate_pool_sizes = [ 5 , 10 , 15 , 25 , 50 ]
results = []
for pool_size in candidate_pool_sizes:
config = load_config( "config.yaml" )
config[ "search" ][ "candidate_pool_size" ] = pool_size
pipeline = initialize_pipeline(config)
start_time = time.time()
query_results = run_evaluation(pipeline, evaluation_queries)
elapsed = time.time() - start_time
metrics = evaluate_retrieval(query_results, k = 10 )
results.append({
"pool_size" : pool_size,
"recall" : metrics.recall_at_k,
"latency_ms" : (elapsed / len (evaluation_queries)) * 1000 ,
"estimated_cost" : estimate_cost(pool_size, len (evaluation_queries))
})
# Plot cost vs quality curve
import matplotlib.pyplot as plt
plt.figure( figsize = ( 10 , 6 ))
plt.scatter(
[r[ "estimated_cost" ] for r in results],
[r[ "recall" ] for r in results]
)
plt.xlabel( "Estimated Cost per 1000 Queries" )
plt.ylabel( "Recall@10" )
plt.title( "Cost-Quality Tradeoff" )
plt.savefig( "cost_quality_curve.png" )
Benchmark configuration best practices
Consistent evaluation sets
Use the same evaluation queries across all runs: # Save evaluation queries for reproducibility
import pickle
with open ( "eval_queries.pkl" , "wb" ) as f:
pickle.dump(evaluation_queries, f)
# Load in future runs
with open ( "eval_queries.pkl" , "rb" ) as f:
evaluation_queries = pickle.load(f)
Run warm-up queries before timing measurements: # Warm up the pipeline
for _ in range ( 5 ):
pipeline.search( "warm up query" , top_k = 10 )
# Now measure performance
start_time = time.time()
# ... run evaluation
Multiple runs for stability
Average metrics over multiple runs: num_runs = 3
all_metrics = []
for run in range (num_runs):
metrics = run_evaluation(pipeline, evaluation_queries)
all_metrics.append(metrics)
# Compute average and standard deviation
avg_recall = np.mean([m.recall_at_k for m in all_metrics])
std_recall = np.std([m.recall_at_k for m in all_metrics])
print ( f "Recall@10: { avg_recall :.3f} ± { std_recall :.3f} " )
Set random seeds for reproducibility: import random
import numpy as np
import torch
# Set seeds
random.seed( 42 )
np.random.seed( 42 )
torch.manual_seed( 42 )
# Now run evaluation
Interpreting results
When to optimize each metric
Recall Optimize when missing relevant documents is costly. Medical diagnosis, legal research, and safety-critical applications.
Precision Optimize when showing irrelevant results harms UX. Consumer search, recommendation systems.
MRR Optimize when users only examine top results. Web search, autocomplete.
NDCG Optimize when ranking quality matters more than binary relevance. E-commerce, content discovery.
Typical metric ranges
Configuration Expected Recall@10 Expected MRR Dense search (baseline) 0.65-0.75 0.45-0.55 Sparse search (BM25) 0.60-0.70 0.40-0.50 Hybrid search 0.75-0.85 0.55-0.65 Hybrid + Reranking 0.80-0.90 0.65-0.75 Agentic RAG 0.85-0.95 0.70-0.80
These ranges assume well-tuned configurations on standard benchmarks like TriviaQA or ARC.
Advanced benchmarking
Per-query analysis
Identify queries where the pipeline struggles:
failed_queries = []
for query_result in query_results:
recall = compute_recall_at_k(
query_result.retrieved_ids,
query_result.relevant_ids,
k = 10
)
if recall < 0.5 :
failed_queries.append({
"query" : query_result.query,
"recall" : recall,
"retrieved" : query_result.retrieved_ids[: 3 ],
"expected" : list (query_result.relevant_ids)
})
# Analyze failure patterns
print ( f "Failed queries: { len (failed_queries) } " )
for failure in failed_queries[: 5 ]:
print ( f " \n Query: { failure[ 'query' ] } " )
print ( f "Recall: { failure[ 'recall' ] :.2f} " )
Ablation studies
Measure the impact of individual components:
ablation_configs = [
( "Baseline" , { "reranking" : False , "query_enhancement" : False }),
( "+ Reranking" , { "reranking" : True , "query_enhancement" : False }),
( "+ Query Enhancement" , { "reranking" : False , "query_enhancement" : True }),
( "+ Both" , { "reranking" : True , "query_enhancement" : True }),
]
for name, overrides in ablation_configs:
config = base_config.copy()
config.update(overrides)
metrics = run_evaluation(config)
print ( f " { name } : Recall@10= { metrics.recall_at_k :.3f} " )
Next steps
Configuration Tune pipeline settings based on benchmark results
Production deployment Deploy your best-performing configuration
Building RAG pipelines Learn to build complete RAG systems
Environment variables Configure benchmarking environments