Measure the quality of search, recommendation, and retrieval systems using Precision, Recall, and NDCG
Ranking evaluation tells you how well a system orders results relative to what users actually find relevant. Use these metrics when evaluating search engines, recommendation feeds, or retrieval stages in RAG pipelines.
3 relevant items exist (A, B, C). The top-3 retrieves A and B and C, so Recall@3 = 3/4 (E is relevant but not retrieved).
Precision and recall trade off against each other. A high-precision system returns few results but most are relevant; a high-recall system returns many results to avoid missing anything. Choose K to match your product’s page size or cutoff.
NDCG (Normalized Discounted Cumulative Gain) accounts for both relevance and position. Highly relevant items appearing lower in the ranking are penalized.
The score is normalized against the ideal ranking (items sorted by relevance descending), so a perfect ranking scores 1.0. Use NDCG when relevance is graded and position quality matters.
Use when users look at only the top K results and you care about result quality more than coverage. Common for web search and featured recommendations.
When to use Recall@K
Use when missing relevant items is costly — for example, legal document retrieval or medical record search where completeness matters.
When to use MAP
Use when evaluating across many queries simultaneously with binary relevance labels. MAP is the standard offline benchmark metric for information retrieval research.
When to use NDCG
Use when relevance is graded (not just relevant/not-relevant) or when the ranking position of highly relevant items matters to your product. NDCG is preferred for e-commerce and recommendation systems.