BEIR Benchmark Hybrid Retrieval via RRF

This guide reproduces hybrid retrieval across a subset of BEIR v1.0.0 datasets using Reciprocal Rank Fusion (RRF) of sparse BM25 and dense BGE-base-en-v1.5 results. Hybrid retrieval is available for DuckDB and PostgreSQL. RRF combines ranked lists from two or more retrieval systems without requiring score normalization. For each document, its fused score is computed as the sum of reciprocal ranks across the input lists. This allows sparse and dense rankings to be merged in a single query without tuning weighting parameters.

The sparse corpus uses a flat index with title and text concatenated into contents. The dense corpus uses BGE-base-en-v1.5 embeddings. Hybrid retrieval runs both in a single search call by passing both table names to quackir.search.

Prerequisites

Hybrid retrieval requires both sparse and dense indexes to already be built. Complete the following guides first:

Sparse retrieval

Build BM25 sparse indexes for DuckDB and PostgreSQL.

Dense retrieval

Build BGE-base-en-v1.5 dense indexes for DuckDB and PostgreSQL.

Also ensure the repository was cloned with --recurse-submodules so the tools/topics-and-qrels/ submodule is available.

Download data

If you have not already downloaded the corpora, get both the raw BEIR corpus and the pre-encoded BGE embeddings:

# Raw corpus (14 GB, MD5: faefd5281b662c72ce03d22021e4ff6b)
wget https://rgw.cs.uwaterloo.ca/pyserini/data/beir-v1.0.0-corpus.tar -P collections/
tar xvf collections/beir-v1.0.0-corpus.tar -C collections/

# Pre-encoded BGE embeddings (127 GB, MD5: 5f8dce18660cc8ac0318500bea5993ac)
wget https://rgw.cs.uwaterloo.ca/pyserini/data/beir-v1.0.0-bge-base-en-v1.5.parquet.tar -P collections/
tar xvf collections/beir-v1.0.0-bge-base-en-v1.5.parquet.tar -C collections/

Hybrid retrieval corpora

The hybrid experiments cover the following 17 datasets:

nfcorpus scifact arguana cqadupstack-mathematica cqadupstack-webmasters
cqadupstack-android scidocs cqadupstack-programmers cqadupstack-gis
cqadupstack-physics cqadupstack-english cqadupstack-stats cqadupstack-gaming
cqadupstack-unix cqadupstack-wordpress fiqa cqadupstack-tex

Larger datasets (e.g., trec-covid, quora, hotpotqa) are excluded because the high query latency makes them impractical for hybrid search at this scale.

Step-by-step hybrid retrieval

Tokenize corpora and prepare query files

Tokenize the sparse corpus and queries, then combine the tokenized queries with their pre-encoded BGE embeddings into a single file used for hybrid search:

CORPORA=(nfcorpus scifact arguana cqadupstack-mathematica cqadupstack-webmasters cqadupstack-android scidocs cqadupstack-programmers cqadupstack-gis cqadupstack-physics cqadupstack-english cqadupstack-stats cqadupstack-gaming cqadupstack-unix cqadupstack-wordpress fiqa cqadupstack-tex)
for c in "${CORPORA[@]}"
do
    echo $c

    # Tokenize and munge the corpus
    python -m quackir.analysis \
    --input ./collections/beir-v1.0.0/corpus/$c/corpus.jsonl \
    --output ./collections/beir-v1.0.0/corpus/$c/parsed_corpus.jsonl

    # Tokenize and munge the queries
    python -m quackir.analysis \
    --input ./tools/topics-and-qrels/topics.beir-v1.0.0-nfcorpus.test.tsv.gz \
    --output ./collections/beir-v1.0.0/corpus/$c/parsed_queries.jsonl

    # Combine parsed queries and query embeddings into one file
    python scripts/combine_contents_vector.py \
    --parsed-file collections/beir-v1.0.0/corpus/$c/parsed_queries.jsonl \
    --embedding-file tools/topics-and-qrels/topics.beir-v1.0.0-$c.test.bge-base-en-v1.5.jsonl.gz \
    --output-file collections/beir-v1.0.0/combined_queries/$c/queries.jsonl
done

The combine_contents_vector.py script merges tokenized query text with pre-encoded query vectors into a single JSONL file. This combined format is required because hybrid search needs both the tokenized text (for sparse BM25) and the embedding vector (for dense cosine similarity) in a single pass.Alternatively, run the dedicated scripts:

bash ./scripts/beir/tokenize.sh > logs/tokenize.txt
bash ./scripts/beir/combine.sh > logs/combine.txt

Build sparse and dense indexes

Index both the sparse and dense representations for each corpus. Note that hybrid retrieval uses distinct table names with _sparse and _dense suffixes to keep the two indexes separate within the same database file:

CORPORA=(nfcorpus scifact arguana cqadupstack-mathematica cqadupstack-webmasters cqadupstack-android scidocs cqadupstack-programmers cqadupstack-gis cqadupstack-physics cqadupstack-english cqadupstack-stats cqadupstack-gaming cqadupstack-unix cqadupstack-wordpress fiqa cqadupstack-tex)
for c in "${CORPORA[@]}"
do
    echo $c

    # Index sparse corpus in DuckDB
    python -m quackir.index \
    --input ./collections/beir-v1.0.0/corpus/$c/parsed_corpus.jsonl \
    --index-type sparse \
    --index "${c}_sparse" \
    --pretokenized \
    --db-type duckdb \
    --db-path duck.db

    # Index dense corpus in DuckDB
    python -m quackir.index \
    --input ./collections/beir-v1.0.0/bge-base-en-v1.5/$c.parquet/ \
    --index-type dense \
    --index "${c}_dense" \
    --db-type duckdb \
    --db-path duck.db

    # Index sparse corpus in PostgreSQL
    python -m quackir.index \
    --input ./collections/beir-v1.0.0/corpus/$c/parsed_corpus.jsonl \
    --index-type sparse \
    --index "${c}_sparse" \
    --pretokenized \
    --db-type postgres

    # Index dense corpus in PostgreSQL
    python -m quackir.index \
    --input ./collections/beir-v1.0.0/bge-base-en-v1.5/$c.parquet/ \
    --index-type dense \
    --index "${c}_dense" \
    --db-type postgres
done

Alternatively, run the dedicated scripts:

bash ./scripts/beir/index_sparse.sh > logs/index_sparse.txt
bash ./scripts/beir/index_bge.sh > logs/index_bge.txt

Run hybrid retrieval

Pass both the sparse and dense table names to quackir.search. QuackIR automatically detects that two table names are provided and performs RRF fusion:

CORPORA=(nfcorpus scifact arguana cqadupstack-mathematica cqadupstack-webmasters cqadupstack-android scidocs cqadupstack-programmers cqadupstack-gis cqadupstack-physics cqadupstack-english cqadupstack-stats cqadupstack-gaming cqadupstack-unix cqadupstack-wordpress fiqa cqadupstack-tex)
for c in "${CORPORA[@]}"
do
    echo $c

    # Retrieval with DuckDB
    python -m quackir.search \
    --topics ./collections/beir-v1.0.0/combined_queries/$c/queries.jsonl \
    --index "${c}_sparse" "${c}_dense" \
    --pretokenized \
    --output runs/duckdb-beir-$c-hybrid.txt \
    --db-type duckdb \
    --db-path duck.db

    # Retrieval with PostgreSQL
    python -m quackir.search \
    --topics ./collections/beir-v1.0.0/combined_queries/$c/queries.jsonl \
    --index "${c}_sparse" "${c}_dense" \
    --pretokenized \
    --output runs/postgres-beir-$c-hybrid.txt \
    --db-type postgres
done

The --index flag accepts two table names: the sparse index (used for BM25) and the dense index (used for cosine similarity). The combined query file provides both tokenized text and the embedding vector for each query in a single pass.Alternatively, run the dedicated script:

bash ./scripts/beir/search_hybrid.sh > logs/search_hybrid.txt

Evaluate with trec_eval

Evaluate all hybrid run files using Pyserini’s trec_eval wrapper:

CORPORA=(nfcorpus scifact arguana cqadupstack-mathematica cqadupstack-webmasters cqadupstack-android scidocs cqadupstack-programmers cqadupstack-gis cqadupstack-physics cqadupstack-english cqadupstack-stats cqadupstack-gaming cqadupstack-unix cqadupstack-wordpress fiqa cqadupstack-tex)
for c in "${CORPORA[@]}"
do
    echo $c

    echo "duckdb"
    python -m pyserini.eval.trec_eval -c -m ndcg_cut.10 tools/topics-and-qrels/qrels.beir-v1.0.0-$c.test.txt runs/duckdb-beir-$c-hybrid.txt

    echo "postgres"
    python -m pyserini.eval.trec_eval -c -m ndcg_cut.10 tools/topics-and-qrels/qrels.beir-v1.0.0-$c.test.txt runs/postgres-beir-$c-hybrid.txt
done

Alternatively, run the dedicated script:

bash ./scripts/beir/eval_hybrid.sh > logs/eval_hybrid.txt

Results

The following nDCG@10 scores are reproducible with the commands above. Datasets not included in the hybrid experiments are marked with a dash (-).

Corpus	DuckDB	PostgreSQL
`nfcorpus`	0.3621	0.3626
`fiqa`	0.3683	0.2881
`arguana`	0.5063	0.3449
`cqadupstack-android`	0.4653	0.4117
`cqadupstack-english`	0.4436	0.3913
`cqadupstack-gaming`	0.5628	0.5022
`cqadupstack-gis`	0.3683	0.3290
`cqadupstack-mathematica`	0.2744	0.2325
`cqadupstack-physics`	0.4138	0.3593
`cqadupstack-programmers`	0.3732	0.3293
`cqadupstack-stats`	0.3400	0.3130
`cqadupstack-tex`	0.2930	0.2581
`cqadupstack-unix`	0.3620	0.3302
`cqadupstack-webmasters`	0.3723	0.3391
`cqadupstack-wordpress`	0.3362	0.2805
`scidocs`	0.1943	0.1750
`scifact`	0.7440	0.6800

Comparison to sparse and dense alone

RRF fusion generally improves over either individual method on most datasets. The table below shows nDCG@10 for all three methods on the DuckDB backend for the datasets covered by all three experiments:

Corpus	Sparse (BM25)	Dense (BGE)	Hybrid (RRF)
`nfcorpus`	0.3206	0.3735	0.3621
`fiqa`	0.2378	0.4065	0.3683
`arguana`	0.3179	0.6361	0.5063
`cqadupstack-android`	0.3812	0.5075	0.4653
`cqadupstack-english`	0.3441	0.4857	0.4436
`cqadupstack-gaming`	0.4827	0.5965	0.5628
`cqadupstack-gis`	0.2893	0.4127	0.3683
`cqadupstack-mathematica`	0.2036	0.3163	0.2744
`cqadupstack-physics`	0.3213	0.4722	0.4138
`cqadupstack-programmers`	0.2803	0.4242	0.3732
`cqadupstack-stats`	0.2728	0.3732	0.3400
`cqadupstack-tex`	0.2256	0.3115	0.2930
`cqadupstack-unix`	0.2779	0.4219	0.3620
`cqadupstack-webmasters`	0.3070	0.4065	0.3723
`cqadupstack-wordpress`	0.2485	0.3547	0.3362
`scidocs`	0.1502	0.2170	0.1943
`scifact`	0.6795	0.7408	0.7440

RRF consistently outperforms sparse retrieval alone. On most datasets it falls between sparse and dense, or exceeds dense (as with scifact). The degree of improvement depends on how complementary the sparse and dense rankings are for each domain.

Get Started

Guides

Experiments

BEIR Benchmark Hybrid Retrieval via RRF

Prerequisites

Sparse retrieval

Dense retrieval

Download data

Hybrid retrieval corpora

Step-by-step hybrid retrieval

Results

Comparison to sparse and dense alone

Build docs developers (and LLMs) love

Get Started

Guides

Experiments

Documentation Index

​Prerequisites

Sparse retrieval

Dense retrieval

​Download data

​Hybrid retrieval corpora

​Step-by-step hybrid retrieval

​Results

​Comparison to sparse and dense alone

Build docs developers (and LLMs) love

Prerequisites

Download data

Hybrid retrieval corpora

Step-by-step hybrid retrieval

Results

Comparison to sparse and dense alone