Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/castorini/quackir/llms.txt

Use this file to discover all available pages before exploring further.

The BEIR benchmark (v1.0.0) is a heterogeneous evaluation suite for zero-shot information retrieval covering 18 datasets across diverse domains. This guide reproduces BM25 sparse retrieval across all BEIR corpora using three database backends: DuckDB, SQLite, and PostgreSQL. Corpora are indexed in a flat manner by concatenating the title and text fields into a single contents field. DuckDB and SQLite implement BM25 scoring; PostgreSQL uses its own full-text search implementation.
Some larger datasets are skipped for PostgreSQL because the high latency makes experimentation impractical. Specifically, trec-covid, bioasq, nq, hotpotqa, signal1m, trec-news, robust04, webis-touche2020, quora, dbpedia-entity, fever, and climate-fever are DuckDB/SQLite only.

Prerequisites

Make sure QuackIR is installed. See the installation guide for setup instructions. For PostgreSQL, ensure the database is initialized and the vector extension is enabled.

Download the BEIR corpora

Download and extract the full BEIR v1.0.0 corpus collection:
wget https://rgw.cs.uwaterloo.ca/pyserini/data/beir-v1.0.0-corpus.tar -P collections/
tar xvf collections/beir-v1.0.0-corpus.tar -C collections/
The tarball is 14 GB with MD5 checksum faefd5281b662c72ce03d22021e4ff6b. Ensure you have sufficient disk space before downloading.
Topics and qrels are stored in the tools/topics-and-qrels/ directory, which is linked as a submodule. Make sure you cloned the repository with --recurse-submodules.

Run all experiments with a single script

To reproduce all sparse, dense, and hybrid BEIR experiments at once, run:
bash ./scripts/beir/run.sh
Results are logged to the logs/ directory. The sections below cover the sparse retrieval steps individually.

Step-by-step sparse retrieval

1

Tokenize and prepare the data

Tokenize all corpora and queries using QuackIR’s analysis module, which wraps Pyserini’s Lucene Porter analyzer:
CORPORA=(trec-covid bioasq nfcorpus nq hotpotqa fiqa signal1m trec-news robust04 arguana webis-touche2020 cqadupstack-android cqadupstack-english cqadupstack-gaming cqadupstack-gis cqadupstack-mathematica cqadupstack-physics cqadupstack-programmers cqadupstack-stats cqadupstack-tex cqadupstack-unix cqadupstack-webmasters cqadupstack-wordpress quora dbpedia-entity scidocs fever climate-fever scifact)
for c in "${CORPORA[@]}"
do
    echo $c

    # Tokenize and munge the corpus
    python -m quackir.analysis \
    --input ./collections/beir-v1.0.0/corpus/$c/corpus.jsonl \
    --output ./collections/beir-v1.0.0/corpus/$c/parsed_corpus.jsonl

    # Tokenize and munge the queries
    python -m quackir.analysis \
    --input ./tools/topics-and-qrels/topics.beir-v1.0.0-$c.test.tsv.gz \
    --output ./collections/beir-v1.0.0/corpus/$c/parsed_queries.jsonl
done
This produces parsed_corpus.jsonl and parsed_queries.jsonl for each dataset. If your data is in an unsupported format, you can write your own script to munge it and tokenize with QuackIR during indexing.Alternatively, run the dedicated script:
bash ./scripts/beir/tokenize.sh > logs/tokenize.txt
2

Index all corpora

Index each corpus into DuckDB, SQLite, and PostgreSQL. The --pretokenized flag tells the indexer to use the pre-tokenized content as-is:
CORPORA=(trec-covid bioasq nfcorpus nq hotpotqa fiqa signal1m trec-news robust04 arguana webis-touche2020 cqadupstack-android cqadupstack-english cqadupstack-gaming cqadupstack-gis cqadupstack-mathematica cqadupstack-physics cqadupstack-programmers cqadupstack-stats cqadupstack-tex cqadupstack-unix cqadupstack-webmasters cqadupstack-wordpress quora dbpedia-entity scidocs fever climate-fever scifact)
for c in "${CORPORA[@]}"
do
    echo $c

    # Index corpus in DuckDB
    python -m quackir.index \
    --input ./collections/beir-v1.0.0/corpus/$c/parsed_corpus.jsonl \
    --index-type sparse \
    --index $c \
    --pretokenized \
    --db-type duckdb \
    --db-path duck.db

    # Index corpus in SQLite
    python -m quackir.index \
    --input ./collections/beir-v1.0.0/corpus/$c/parsed_corpus.jsonl \
    --index-type sparse \
    --index $c \
    --pretokenized \
    --db-type sqlite \
    --db-path sqlite.db

    # Skip large corpora for PostgreSQL
    if [[ "$c" == "trec-covid" || "$c" == "webis-touche2020" || "$c" == "quora" || "$c" == "robust04" || "$c" == "trec-news" || "$c" == "nq" || "$c" == "signal1m" || "$c" == "dbpedia-entity" || "$c" == "hotpotqa" || "$c" == "fever" || "$c" == "climate-fever" || "$c" == "bioasq" ]]; then
      continue
    fi

    # Index corpus in PostgreSQL
    python -m quackir.index \
    --input ./collections/beir-v1.0.0/corpus/$c/parsed_corpus.jsonl \
    --index-type sparse \
    --index $c \
    --pretokenized \
    --db-type postgres
done
Alternatively, run the dedicated script:
bash ./scripts/beir/index_sparse.sh > logs/index_sparse.txt
3

Run retrieval

After indexing, run sparse retrieval for all corpora:
CORPORA=(trec-covid bioasq nfcorpus nq hotpotqa fiqa signal1m trec-news robust04 arguana webis-touche2020 cqadupstack-android cqadupstack-english cqadupstack-gaming cqadupstack-gis cqadupstack-mathematica cqadupstack-physics cqadupstack-programmers cqadupstack-stats cqadupstack-tex cqadupstack-unix cqadupstack-webmasters cqadupstack-wordpress quora dbpedia-entity scidocs fever climate-fever scifact)
for c in "${CORPORA[@]}"
do
    echo $c

    # Retrieval with DuckDB
    python -m quackir.search \
    --topics ./collections/beir-v1.0.0/corpus/$c/parsed_queries.jsonl \
    --index $c \
    --pretokenized \
    --output runs/duckdb-beir-$c-sparse.txt \
    --db-type duckdb \
    --db-path duck.db

    # Retrieval with SQLite
    python -m quackir.search \
    --topics ./collections/beir-v1.0.0/corpus/$c/parsed_queries.jsonl \
    --index $c \
    --pretokenized \
    --output runs/sqlite-beir-$c-sparse.txt \
    --db-type sqlite \
    --db-path sqlite.db

    # Skip large corpora for PostgreSQL
    if [[ "$c" == "trec-covid" || "$c" == "webis-touche2020" || "$c" == "quora" || "$c" == "robust04" || "$c" == "trec-news" || "$c" == "nq" || "$c" == "signal1m" || "$c" == "dbpedia-entity" || "$c" == "hotpotqa" || "$c" == "fever" || "$c" == "climate-fever" || "$c" == "bioasq" ]]; then
      continue
    fi

    # Retrieval with PostgreSQL
    python -m quackir.search \
    --topics ./collections/beir-v1.0.0/corpus/$c/parsed_queries.jsonl \
    --index $c \
    --pretokenized \
    --output runs/postgres-beir-$c-sparse.txt \
    --db-type postgres
done
Alternatively, run the dedicated script:
bash ./scripts/beir/search_sparse.sh > logs/search_sparse.txt
4

Evaluate with trec_eval

Evaluate all run files using Pyserini’s trec_eval wrapper:
CORPORA=(trec-covid bioasq nfcorpus nq hotpotqa fiqa signal1m trec-news robust04 arguana webis-touche2020 cqadupstack-android cqadupstack-english cqadupstack-gaming cqadupstack-gis cqadupstack-mathematica cqadupstack-physics cqadupstack-programmers cqadupstack-stats cqadupstack-tex cqadupstack-unix cqadupstack-webmasters cqadupstack-wordpress quora dbpedia-entity scidocs fever climate-fever scifact)
for c in "${CORPORA[@]}"
do
    echo $c

    echo "duckdb"
    python -m pyserini.eval.trec_eval -c -m ndcg_cut.10 tools/topics-and-qrels/qrels.beir-v1.0.0-$c.test.txt runs/duckdb-beir-$c-sparse.txt

    echo "sqlite"
    python -m pyserini.eval.trec_eval -c -m ndcg_cut.10 ./tools/topics-and-qrels/qrels.beir-v1.0.0-$c.test.txt runs/sqlite-beir-$c-sparse.txt

    # Skip large corpora for PostgreSQL
    if [[ "$c" == "trec-covid" || "$c" == "webis-touche2020" || "$c" == "quora" || "$c" == "robust04" || "$c" == "trec-news" || "$c" == "nq" || "$c" == "signal1m" || "$c" == "dbpedia-entity" || "$c" == "hotpotqa" || "$c" == "fever" || "$c" == "climate-fever" || "$c" == "bioasq" ]]; then
      continue
    fi

    echo "postgres"
    python -m pyserini.eval.trec_eval -c -m ndcg_cut.10 tools/topics-and-qrels/qrels.beir-v1.0.0-$c.test.txt runs/postgres-beir-$c-sparse.txt
done
Alternatively, run the dedicated script:
bash ./scripts/beir/eval_sparse.sh > logs/eval_sparse.txt

Results

The following nDCG@10 scores are reproducible with the commands above. A dash (-) indicates the corpus was skipped for that backend due to high latency.
CorpusDuckDBSQLitePostgreSQL
trec-covid0.59470.6011-
bioasq0.52100.5130-
nfcorpus0.32060.32230.2965
nq0.30500.2921-
hotpotqa0.63570.5933-
fiqa0.23780.25180.0918
signal1m0.33960.3308-
trec-news0.38490.4031-
robust040.40810.4243-
arguana0.31790.48060.0690
webis-touche20200.43520.3471-
cqadupstack-android0.38120.39420.2607
cqadupstack-english0.34410.36720.2252
cqadupstack-gaming0.48270.48760.3436
cqadupstack-gis0.28930.30020.1864
cqadupstack-mathematica0.20360.21850.1215
cqadupstack-physics0.32130.34740.2053
cqadupstack-programmers0.28030.29650.1866
cqadupstack-stats0.27280.28380.1828
cqadupstack-tex0.22560.24190.1303
cqadupstack-unix0.27790.28690.1678
cqadupstack-webmasters0.30700.30780.2319
cqadupstack-wordpress0.24850.25790.1280
quora0.78930.8063-
dbpedia-entity0.31770.3191-
scidocs0.15020.15420.0907
fever0.64750.5590-
climate-fever0.14860.1335-
scifact0.67950.68620.5692

PostgreSQL scoring note

PostgreSQL does not implement BM25 like DuckDB and SQLite. The current configuration uses the simple text search configuration with no stopwords, which is consistent with how DuckDB and SQLite operate (both tokenize using Pyserini’s Lucene analyzer with native RDBMS text processing turned off as much as possible). A modified english configuration using the Snowball stemmer and length normalization produces higher nDCG@10 scores, but at the cost of more than one second per query even on the smallest BEIR datasets. The table below shows both configurations:
CorpusDefault (simple)Modified (english + length norm)
nfcorpus0.29650.3055
fiqa0.09180.1805
arguana0.06900.2549
cqadupstack-android0.26070.3423
cqadupstack-gaming0.34360.4116
scifact0.56920.6064
The “Default” PostgreSQL scores in the main results table use the simple configuration. The comparison is not a fair apples-to-apples comparison against BM25, but reflects the current implementation shipped with QuackIR.

Build docs developers (and LLMs) love