This guide reproduces BGE-base-en-v1.5 dense retrieval across BEIR v1.0.0 datasets using DuckDB and PostgreSQL as backends. Pre-encoded document embeddings are stored as fixed-size vector arrays and retrieved using exact cosine similarity search. Dense retrieval encodes both documents and queries into 768-dimensional L2-normalized vectors using theDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/castorini/quackir/llms.txt
Use this file to discover all available pages before exploring further.
BAAI/bge-base-en-v1.5 model. At retrieval time, DuckDB’s array_cosine_similarity function scores all documents against the query vector. Because vectors are L2-normalized, cosine similarity equals dot product.
Some datasets from the full BEIR benchmark are not included because the high latency makes them impractical for dense retrieval at this scale. The included corpora are:
nfcorpus, scifact, arguana, all cqadupstack-* subsets, scidocs, fiqa, trec-covid, webis-touche2020, quora, robust04, and trec-news.Prerequisites
Make sure QuackIR is installed. See the installation guide. For PostgreSQL, ensure the database is initialized and thevector extension is enabled (required for dense vector storage).
Download pre-encoded embeddings
All BEIR corpora pre-encoded with BGE-base-en-v1.5 and stored in Parquet format are available for download:tools/topics-and-qrels/ as gzipped JSONL files (e.g., topics.beir-v1.0.0-nfcorpus.test.bge-base-en-v1.5.jsonl.gz). These are part of the anserini-tools submodule — make sure you cloned the repository with --recurse-submodules.
Step-by-step dense retrieval
Index all corpora
Load the pre-encoded Parquet embeddings into DuckDB and PostgreSQL:Alternatively, run the dedicated script:Unlike sparse indexing, there is no tokenization step and no
--pretokenized flag. The Parquet files already contain only the id and vector fields that the dense indexer expects.Run dense retrieval
Run retrieval for all corpora using pre-encoded query embeddings:Alternatively, run the dedicated script:Note that there is no
--pretokenized flag for dense retrieval — the topic files contain query embeddings (vectors), not tokenized text.Results
The following nDCG@10 scores are reproducible with the commands above. A dash (-) indicates the corpus was not included in the dense retrieval experiments.
DuckDB and PostgreSQL produce identical scores because both perform exact cosine similarity search over the same pre-encoded vectors.
| Corpus | DuckDB | PostgreSQL |
|---|---|---|
trec-covid | 0.7814 | 0.7814 |
nfcorpus | 0.3735 | 0.3735 |
fiqa | 0.4065 | 0.4065 |
trec-news | 0.4425 | 0.4425 |
robust04 | 0.4465 | 0.4465 |
arguana | 0.6361 | 0.6361 |
webis-touche2020 | 0.2570 | 0.2570 |
cqadupstack-android | 0.5075 | 0.5075 |
cqadupstack-english | 0.4857 | 0.4857 |
cqadupstack-gaming | 0.5965 | 0.5965 |
cqadupstack-gis | 0.4127 | 0.4127 |
cqadupstack-mathematica | 0.3163 | 0.3163 |
cqadupstack-physics | 0.4722 | 0.4722 |
cqadupstack-programmers | 0.4242 | 0.4242 |
cqadupstack-stats | 0.3732 | 0.3732 |
cqadupstack-tex | 0.3115 | 0.3115 |
cqadupstack-unix | 0.4219 | 0.4219 |
cqadupstack-webmasters | 0.4065 | 0.4065 |
cqadupstack-wordpress | 0.3547 | 0.3547 |
quora | 0.8890 | 0.8890 |
scidocs | 0.2170 | 0.2170 |
scifact | 0.7408 | 0.7408 |
bioasq, nq, hotpotqa, signal1m, dbpedia-entity, fever, climate-fever) are marked as - in the source documentation.