BEIR Benchmark Sparse Retrieval with QuackIR

The BEIR benchmark (v1.0.0) is a heterogeneous evaluation suite for zero-shot information retrieval covering 18 datasets across diverse domains. This guide reproduces BM25 sparse retrieval across all BEIR corpora using three database backends: DuckDB, SQLite, and PostgreSQL. Corpora are indexed in a flat manner by concatenating the title and text fields into a single contents field. DuckDB and SQLite implement BM25 scoring; PostgreSQL uses its own full-text search implementation.

Some larger datasets are skipped for PostgreSQL because the high latency makes experimentation impractical. Specifically, trec-covid, bioasq, nq, hotpotqa, signal1m, trec-news, robust04, webis-touche2020, quora, dbpedia-entity, fever, and climate-fever are DuckDB/SQLite only.

Prerequisites

Make sure QuackIR is installed. See the installation guide for setup instructions. For PostgreSQL, ensure the database is initialized and the vector extension is enabled.

Download the BEIR corpora

Download and extract the full BEIR v1.0.0 corpus collection:

wget https://rgw.cs.uwaterloo.ca/pyserini/data/beir-v1.0.0-corpus.tar -P collections/
tar xvf collections/beir-v1.0.0-corpus.tar -C collections/

The tarball is 14 GB with MD5 checksum faefd5281b662c72ce03d22021e4ff6b. Ensure you have sufficient disk space before downloading.

Topics and qrels are stored in the tools/topics-and-qrels/ directory, which is linked as a submodule. Make sure you cloned the repository with --recurse-submodules.

Run all experiments with a single script

To reproduce all sparse, dense, and hybrid BEIR experiments at once, run:

bash ./scripts/beir/run.sh

Results are logged to the logs/ directory. The sections below cover the sparse retrieval steps individually.

Step-by-step sparse retrieval

Tokenize and prepare the data

Tokenize all corpora and queries using QuackIR’s analysis module, which wraps Pyserini’s Lucene Porter analyzer:

CORPORA=(trec-covid bioasq nfcorpus nq hotpotqa fiqa signal1m trec-news robust04 arguana webis-touche2020 cqadupstack-android cqadupstack-english cqadupstack-gaming cqadupstack-gis cqadupstack-mathematica cqadupstack-physics cqadupstack-programmers cqadupstack-stats cqadupstack-tex cqadupstack-unix cqadupstack-webmasters cqadupstack-wordpress quora dbpedia-entity scidocs fever climate-fever scifact)
for c in "${CORPORA[@]}"
do
    echo $c

    # Tokenize and munge the corpus
    python -m quackir.analysis \
    --input ./collections/beir-v1.0.0/corpus/$c/corpus.jsonl \
    --output ./collections/beir-v1.0.0/corpus/$c/parsed_corpus.jsonl

    # Tokenize and munge the queries
    python -m quackir.analysis \
    --input ./tools/topics-and-qrels/topics.beir-v1.0.0-$c.test.tsv.gz \
    --output ./collections/beir-v1.0.0/corpus/$c/parsed_queries.jsonl
done

This produces parsed_corpus.jsonl and parsed_queries.jsonl for each dataset. If your data is in an unsupported format, you can write your own script to munge it and tokenize with QuackIR during indexing.Alternatively, run the dedicated script:

bash ./scripts/beir/tokenize.sh > logs/tokenize.txt

Index all corpora

Index each corpus into DuckDB, SQLite, and PostgreSQL. The --pretokenized flag tells the indexer to use the pre-tokenized content as-is:

CORPORA=(trec-covid bioasq nfcorpus nq hotpotqa fiqa signal1m trec-news robust04 arguana webis-touche2020 cqadupstack-android cqadupstack-english cqadupstack-gaming cqadupstack-gis cqadupstack-mathematica cqadupstack-physics cqadupstack-programmers cqadupstack-stats cqadupstack-tex cqadupstack-unix cqadupstack-webmasters cqadupstack-wordpress quora dbpedia-entity scidocs fever climate-fever scifact)
for c in "${CORPORA[@]}"
do
    echo $c

    # Index corpus in DuckDB
    python -m quackir.index \
    --input ./collections/beir-v1.0.0/corpus/$c/parsed_corpus.jsonl \
    --index-type sparse \
    --index $c \
    --pretokenized \
    --db-type duckdb \
    --db-path duck.db

    # Index corpus in SQLite
    python -m quackir.index \
    --input ./collections/beir-v1.0.0/corpus/$c/parsed_corpus.jsonl \
    --index-type sparse \
    --index $c \
    --pretokenized \
    --db-type sqlite \
    --db-path sqlite.db

    # Skip large corpora for PostgreSQL
    if [[ "$c" == "trec-covid" || "$c" == "webis-touche2020" || "$c" == "quora" || "$c" == "robust04" || "$c" == "trec-news" || "$c" == "nq" || "$c" == "signal1m" || "$c" == "dbpedia-entity" || "$c" == "hotpotqa" || "$c" == "fever" || "$c" == "climate-fever" || "$c" == "bioasq" ]]; then
      continue
    fi

    # Index corpus in PostgreSQL
    python -m quackir.index \
    --input ./collections/beir-v1.0.0/corpus/$c/parsed_corpus.jsonl \
    --index-type sparse \
    --index $c \
    --pretokenized \
    --db-type postgres
done

Alternatively, run the dedicated script:

bash ./scripts/beir/index_sparse.sh > logs/index_sparse.txt

Run retrieval

After indexing, run sparse retrieval for all corpora:

CORPORA=(trec-covid bioasq nfcorpus nq hotpotqa fiqa signal1m trec-news robust04 arguana webis-touche2020 cqadupstack-android cqadupstack-english cqadupstack-gaming cqadupstack-gis cqadupstack-mathematica cqadupstack-physics cqadupstack-programmers cqadupstack-stats cqadupstack-tex cqadupstack-unix cqadupstack-webmasters cqadupstack-wordpress quora dbpedia-entity scidocs fever climate-fever scifact)
for c in "${CORPORA[@]}"
do
    echo $c

    # Retrieval with DuckDB
    python -m quackir.search \
    --topics ./collections/beir-v1.0.0/corpus/$c/parsed_queries.jsonl \
    --index $c \
    --pretokenized \
    --output runs/duckdb-beir-$c-sparse.txt \
    --db-type duckdb \
    --db-path duck.db

    # Retrieval with SQLite
    python -m quackir.search \
    --topics ./collections/beir-v1.0.0/corpus/$c/parsed_queries.jsonl \
    --index $c \
    --pretokenized \
    --output runs/sqlite-beir-$c-sparse.txt \
    --db-type sqlite \
    --db-path sqlite.db

    # Skip large corpora for PostgreSQL
    if [[ "$c" == "trec-covid" || "$c" == "webis-touche2020" || "$c" == "quora" || "$c" == "robust04" || "$c" == "trec-news" || "$c" == "nq" || "$c" == "signal1m" || "$c" == "dbpedia-entity" || "$c" == "hotpotqa" || "$c" == "fever" || "$c" == "climate-fever" || "$c" == "bioasq" ]]; then
      continue
    fi

    # Retrieval with PostgreSQL
    python -m quackir.search \
    --topics ./collections/beir-v1.0.0/corpus/$c/parsed_queries.jsonl \
    --index $c \
    --pretokenized \
    --output runs/postgres-beir-$c-sparse.txt \
    --db-type postgres
done

Alternatively, run the dedicated script:

bash ./scripts/beir/search_sparse.sh > logs/search_sparse.txt

Evaluate with trec_eval

Evaluate all run files using Pyserini’s trec_eval wrapper:

CORPORA=(trec-covid bioasq nfcorpus nq hotpotqa fiqa signal1m trec-news robust04 arguana webis-touche2020 cqadupstack-android cqadupstack-english cqadupstack-gaming cqadupstack-gis cqadupstack-mathematica cqadupstack-physics cqadupstack-programmers cqadupstack-stats cqadupstack-tex cqadupstack-unix cqadupstack-webmasters cqadupstack-wordpress quora dbpedia-entity scidocs fever climate-fever scifact)
for c in "${CORPORA[@]}"
do
    echo $c

    echo "duckdb"
    python -m pyserini.eval.trec_eval -c -m ndcg_cut.10 tools/topics-and-qrels/qrels.beir-v1.0.0-$c.test.txt runs/duckdb-beir-$c-sparse.txt

    echo "sqlite"
    python -m pyserini.eval.trec_eval -c -m ndcg_cut.10 ./tools/topics-and-qrels/qrels.beir-v1.0.0-$c.test.txt runs/sqlite-beir-$c-sparse.txt

    # Skip large corpora for PostgreSQL
    if [[ "$c" == "trec-covid" || "$c" == "webis-touche2020" || "$c" == "quora" || "$c" == "robust04" || "$c" == "trec-news" || "$c" == "nq" || "$c" == "signal1m" || "$c" == "dbpedia-entity" || "$c" == "hotpotqa" || "$c" == "fever" || "$c" == "climate-fever" || "$c" == "bioasq" ]]; then
      continue
    fi

    echo "postgres"
    python -m pyserini.eval.trec_eval -c -m ndcg_cut.10 tools/topics-and-qrels/qrels.beir-v1.0.0-$c.test.txt runs/postgres-beir-$c-sparse.txt
done

Alternatively, run the dedicated script:

bash ./scripts/beir/eval_sparse.sh > logs/eval_sparse.txt

Results

The following nDCG@10 scores are reproducible with the commands above. A dash (-) indicates the corpus was skipped for that backend due to high latency.

Corpus	DuckDB	SQLite	PostgreSQL
`trec-covid`	0.5947	0.6011	-
`bioasq`	0.5210	0.5130	-
`nfcorpus`	0.3206	0.3223	0.2965
`nq`	0.3050	0.2921	-
`hotpotqa`	0.6357	0.5933	-
`fiqa`	0.2378	0.2518	0.0918
`signal1m`	0.3396	0.3308	-
`trec-news`	0.3849	0.4031	-
`robust04`	0.4081	0.4243	-
`arguana`	0.3179	0.4806	0.0690
`webis-touche2020`	0.4352	0.3471	-
`cqadupstack-android`	0.3812	0.3942	0.2607
`cqadupstack-english`	0.3441	0.3672	0.2252
`cqadupstack-gaming`	0.4827	0.4876	0.3436
`cqadupstack-gis`	0.2893	0.3002	0.1864
`cqadupstack-mathematica`	0.2036	0.2185	0.1215
`cqadupstack-physics`	0.3213	0.3474	0.2053
`cqadupstack-programmers`	0.2803	0.2965	0.1866
`cqadupstack-stats`	0.2728	0.2838	0.1828
`cqadupstack-tex`	0.2256	0.2419	0.1303
`cqadupstack-unix`	0.2779	0.2869	0.1678
`cqadupstack-webmasters`	0.3070	0.3078	0.2319
`cqadupstack-wordpress`	0.2485	0.2579	0.1280
`quora`	0.7893	0.8063	-
`dbpedia-entity`	0.3177	0.3191	-
`scidocs`	0.1502	0.1542	0.0907
`fever`	0.6475	0.5590	-
`climate-fever`	0.1486	0.1335	-
`scifact`	0.6795	0.6862	0.5692

PostgreSQL scoring note

PostgreSQL does not implement BM25 like DuckDB and SQLite. The current configuration uses the simple text search configuration with no stopwords, which is consistent with how DuckDB and SQLite operate (both tokenize using Pyserini’s Lucene analyzer with native RDBMS text processing turned off as much as possible). A modified english configuration using the Snowball stemmer and length normalization produces higher nDCG@10 scores, but at the cost of more than one second per query even on the smallest BEIR datasets. The table below shows both configurations:

Corpus	Default (simple)	Modified (english + length norm)
`nfcorpus`	0.2965	0.3055
`fiqa`	0.0918	0.1805
`arguana`	0.0690	0.2549
`cqadupstack-android`	0.2607	0.3423
`cqadupstack-gaming`	0.3436	0.4116
`scifact`	0.5692	0.6064

The “Default” PostgreSQL scores in the main results table use the simple configuration. The comparison is not a fair apples-to-apples comparison against BM25, but reflects the current implementation shipped with QuackIR.

Get Started

Guides

Experiments

BEIR Benchmark Sparse Retrieval with QuackIR

Prerequisites

Download the BEIR corpora

Run all experiments with a single script

Step-by-step sparse retrieval

Results

PostgreSQL scoring note

Build docs developers (and LLMs) love

Get Started

Guides

Experiments

Documentation Index

​Prerequisites

​Download the BEIR corpora

​Run all experiments with a single script

​Step-by-step sparse retrieval

​Results

​PostgreSQL scoring note

Build docs developers (and LLMs) love

Prerequisites

Download the BEIR corpora

Run all experiments with a single script

Step-by-step sparse retrieval

Results

PostgreSQL scoring note