The BEIR benchmark (v1.0.0) is a heterogeneous evaluation suite for zero-shot information retrieval covering 18 datasets across diverse domains. This guide reproduces BM25 sparse retrieval across all BEIR corpora using three database backends: DuckDB, SQLite, and PostgreSQL. Corpora are indexed in a flat manner by concatenating theDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/castorini/quackir/llms.txt
Use this file to discover all available pages before exploring further.
title and text fields into a single contents field. DuckDB and SQLite implement BM25 scoring; PostgreSQL uses its own full-text search implementation.
Some larger datasets are skipped for PostgreSQL because the high latency makes experimentation impractical. Specifically,
trec-covid, bioasq, nq, hotpotqa, signal1m, trec-news, robust04, webis-touche2020, quora, dbpedia-entity, fever, and climate-fever are DuckDB/SQLite only.Prerequisites
Make sure QuackIR is installed. See the installation guide for setup instructions. For PostgreSQL, ensure the database is initialized and thevector extension is enabled.
Download the BEIR corpora
Download and extract the full BEIR v1.0.0 corpus collection:tools/topics-and-qrels/ directory, which is linked as a submodule. Make sure you cloned the repository with --recurse-submodules.
Run all experiments with a single script
To reproduce all sparse, dense, and hybrid BEIR experiments at once, run:logs/ directory. The sections below cover the sparse retrieval steps individually.
Step-by-step sparse retrieval
Tokenize and prepare the data
Tokenize all corpora and queries using QuackIR’s This produces
analysis module, which wraps Pyserini’s Lucene Porter analyzer:parsed_corpus.jsonl and parsed_queries.jsonl for each dataset. If your data is in an unsupported format, you can write your own script to munge it and tokenize with QuackIR during indexing.Alternatively, run the dedicated script:Index all corpora
Index each corpus into DuckDB, SQLite, and PostgreSQL. The Alternatively, run the dedicated script:
--pretokenized flag tells the indexer to use the pre-tokenized content as-is:Run retrieval
After indexing, run sparse retrieval for all corpora:Alternatively, run the dedicated script:
Results
The following nDCG@10 scores are reproducible with the commands above. A dash (-) indicates the corpus was skipped for that backend due to high latency.
| Corpus | DuckDB | SQLite | PostgreSQL |
|---|---|---|---|
trec-covid | 0.5947 | 0.6011 | - |
bioasq | 0.5210 | 0.5130 | - |
nfcorpus | 0.3206 | 0.3223 | 0.2965 |
nq | 0.3050 | 0.2921 | - |
hotpotqa | 0.6357 | 0.5933 | - |
fiqa | 0.2378 | 0.2518 | 0.0918 |
signal1m | 0.3396 | 0.3308 | - |
trec-news | 0.3849 | 0.4031 | - |
robust04 | 0.4081 | 0.4243 | - |
arguana | 0.3179 | 0.4806 | 0.0690 |
webis-touche2020 | 0.4352 | 0.3471 | - |
cqadupstack-android | 0.3812 | 0.3942 | 0.2607 |
cqadupstack-english | 0.3441 | 0.3672 | 0.2252 |
cqadupstack-gaming | 0.4827 | 0.4876 | 0.3436 |
cqadupstack-gis | 0.2893 | 0.3002 | 0.1864 |
cqadupstack-mathematica | 0.2036 | 0.2185 | 0.1215 |
cqadupstack-physics | 0.3213 | 0.3474 | 0.2053 |
cqadupstack-programmers | 0.2803 | 0.2965 | 0.1866 |
cqadupstack-stats | 0.2728 | 0.2838 | 0.1828 |
cqadupstack-tex | 0.2256 | 0.2419 | 0.1303 |
cqadupstack-unix | 0.2779 | 0.2869 | 0.1678 |
cqadupstack-webmasters | 0.3070 | 0.3078 | 0.2319 |
cqadupstack-wordpress | 0.2485 | 0.2579 | 0.1280 |
quora | 0.7893 | 0.8063 | - |
dbpedia-entity | 0.3177 | 0.3191 | - |
scidocs | 0.1502 | 0.1542 | 0.0907 |
fever | 0.6475 | 0.5590 | - |
climate-fever | 0.1486 | 0.1335 | - |
scifact | 0.6795 | 0.6862 | 0.5692 |
PostgreSQL scoring note
PostgreSQL does not implement BM25 like DuckDB and SQLite. The current configuration uses thesimple text search configuration with no stopwords, which is consistent with how DuckDB and SQLite operate (both tokenize using Pyserini’s Lucene analyzer with native RDBMS text processing turned off as much as possible).
A modified english configuration using the Snowball stemmer and length normalization produces higher nDCG@10 scores, but at the cost of more than one second per query even on the smallest BEIR datasets. The table below shows both configurations:
| Corpus | Default (simple) | Modified (english + length norm) |
|---|---|---|
nfcorpus | 0.2965 | 0.3055 |
fiqa | 0.0918 | 0.1805 |
arguana | 0.0690 | 0.2549 |
cqadupstack-android | 0.2607 | 0.3423 |
cqadupstack-gaming | 0.3436 | 0.4116 |
scifact | 0.5692 | 0.6064 |
The “Default” PostgreSQL scores in the main results table use the
simple configuration. The comparison is not a fair apples-to-apples comparison against BM25, but reflects the current implementation shipped with QuackIR.