Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/castorini/quackir/llms.txt

Use this file to discover all available pages before exploring further.

This guide reproduces hybrid retrieval across a subset of BEIR v1.0.0 datasets using Reciprocal Rank Fusion (RRF) of sparse BM25 and dense BGE-base-en-v1.5 results. Hybrid retrieval is available for DuckDB and PostgreSQL. RRF combines ranked lists from two or more retrieval systems without requiring score normalization. For each document, its fused score is computed as the sum of reciprocal ranks across the input lists. This allows sparse and dense rankings to be merged in a single query without tuning weighting parameters.
The sparse corpus uses a flat index with title and text concatenated into contents. The dense corpus uses BGE-base-en-v1.5 embeddings. Hybrid retrieval runs both in a single search call by passing both table names to quackir.search.

Prerequisites

Hybrid retrieval requires both sparse and dense indexes to already be built. Complete the following guides first:

Sparse retrieval

Build BM25 sparse indexes for DuckDB and PostgreSQL.

Dense retrieval

Build BGE-base-en-v1.5 dense indexes for DuckDB and PostgreSQL.
Also ensure the repository was cloned with --recurse-submodules so the tools/topics-and-qrels/ submodule is available.

Download data

If you have not already downloaded the corpora, get both the raw BEIR corpus and the pre-encoded BGE embeddings:
# Raw corpus (14 GB, MD5: faefd5281b662c72ce03d22021e4ff6b)
wget https://rgw.cs.uwaterloo.ca/pyserini/data/beir-v1.0.0-corpus.tar -P collections/
tar xvf collections/beir-v1.0.0-corpus.tar -C collections/

# Pre-encoded BGE embeddings (127 GB, MD5: 5f8dce18660cc8ac0318500bea5993ac)
wget https://rgw.cs.uwaterloo.ca/pyserini/data/beir-v1.0.0-bge-base-en-v1.5.parquet.tar -P collections/
tar xvf collections/beir-v1.0.0-bge-base-en-v1.5.parquet.tar -C collections/

Hybrid retrieval corpora

The hybrid experiments cover the following 17 datasets:
nfcorpus scifact arguana cqadupstack-mathematica cqadupstack-webmasters
cqadupstack-android scidocs cqadupstack-programmers cqadupstack-gis
cqadupstack-physics cqadupstack-english cqadupstack-stats cqadupstack-gaming
cqadupstack-unix cqadupstack-wordpress fiqa cqadupstack-tex
Larger datasets (e.g., trec-covid, quora, hotpotqa) are excluded because the high query latency makes them impractical for hybrid search at this scale.

Step-by-step hybrid retrieval

1

Tokenize corpora and prepare query files

Tokenize the sparse corpus and queries, then combine the tokenized queries with their pre-encoded BGE embeddings into a single file used for hybrid search:
CORPORA=(nfcorpus scifact arguana cqadupstack-mathematica cqadupstack-webmasters cqadupstack-android scidocs cqadupstack-programmers cqadupstack-gis cqadupstack-physics cqadupstack-english cqadupstack-stats cqadupstack-gaming cqadupstack-unix cqadupstack-wordpress fiqa cqadupstack-tex)
for c in "${CORPORA[@]}"
do
    echo $c

    # Tokenize and munge the corpus
    python -m quackir.analysis \
    --input ./collections/beir-v1.0.0/corpus/$c/corpus.jsonl \
    --output ./collections/beir-v1.0.0/corpus/$c/parsed_corpus.jsonl

    # Tokenize and munge the queries
    python -m quackir.analysis \
    --input ./tools/topics-and-qrels/topics.beir-v1.0.0-nfcorpus.test.tsv.gz \
    --output ./collections/beir-v1.0.0/corpus/$c/parsed_queries.jsonl

    # Combine parsed queries and query embeddings into one file
    python scripts/combine_contents_vector.py \
    --parsed-file collections/beir-v1.0.0/corpus/$c/parsed_queries.jsonl \
    --embedding-file tools/topics-and-qrels/topics.beir-v1.0.0-$c.test.bge-base-en-v1.5.jsonl.gz \
    --output-file collections/beir-v1.0.0/combined_queries/$c/queries.jsonl
done
The combine_contents_vector.py script merges tokenized query text with pre-encoded query vectors into a single JSONL file. This combined format is required because hybrid search needs both the tokenized text (for sparse BM25) and the embedding vector (for dense cosine similarity) in a single pass.Alternatively, run the dedicated scripts:
bash ./scripts/beir/tokenize.sh > logs/tokenize.txt
bash ./scripts/beir/combine.sh > logs/combine.txt
2

Build sparse and dense indexes

Index both the sparse and dense representations for each corpus. Note that hybrid retrieval uses distinct table names with _sparse and _dense suffixes to keep the two indexes separate within the same database file:
CORPORA=(nfcorpus scifact arguana cqadupstack-mathematica cqadupstack-webmasters cqadupstack-android scidocs cqadupstack-programmers cqadupstack-gis cqadupstack-physics cqadupstack-english cqadupstack-stats cqadupstack-gaming cqadupstack-unix cqadupstack-wordpress fiqa cqadupstack-tex)
for c in "${CORPORA[@]}"
do
    echo $c

    # Index sparse corpus in DuckDB
    python -m quackir.index \
    --input ./collections/beir-v1.0.0/corpus/$c/parsed_corpus.jsonl \
    --index-type sparse \
    --index "${c}_sparse" \
    --pretokenized \
    --db-type duckdb \
    --db-path duck.db

    # Index dense corpus in DuckDB
    python -m quackir.index \
    --input ./collections/beir-v1.0.0/bge-base-en-v1.5/$c.parquet/ \
    --index-type dense \
    --index "${c}_dense" \
    --db-type duckdb \
    --db-path duck.db

    # Index sparse corpus in PostgreSQL
    python -m quackir.index \
    --input ./collections/beir-v1.0.0/corpus/$c/parsed_corpus.jsonl \
    --index-type sparse \
    --index "${c}_sparse" \
    --pretokenized \
    --db-type postgres

    # Index dense corpus in PostgreSQL
    python -m quackir.index \
    --input ./collections/beir-v1.0.0/bge-base-en-v1.5/$c.parquet/ \
    --index-type dense \
    --index "${c}_dense" \
    --db-type postgres
done
Alternatively, run the dedicated scripts:
bash ./scripts/beir/index_sparse.sh > logs/index_sparse.txt
bash ./scripts/beir/index_bge.sh > logs/index_bge.txt
3

Run hybrid retrieval

Pass both the sparse and dense table names to quackir.search. QuackIR automatically detects that two table names are provided and performs RRF fusion:
CORPORA=(nfcorpus scifact arguana cqadupstack-mathematica cqadupstack-webmasters cqadupstack-android scidocs cqadupstack-programmers cqadupstack-gis cqadupstack-physics cqadupstack-english cqadupstack-stats cqadupstack-gaming cqadupstack-unix cqadupstack-wordpress fiqa cqadupstack-tex)
for c in "${CORPORA[@]}"
do
    echo $c

    # Retrieval with DuckDB
    python -m quackir.search \
    --topics ./collections/beir-v1.0.0/combined_queries/$c/queries.jsonl \
    --index "${c}_sparse" "${c}_dense" \
    --pretokenized \
    --output runs/duckdb-beir-$c-hybrid.txt \
    --db-type duckdb \
    --db-path duck.db

    # Retrieval with PostgreSQL
    python -m quackir.search \
    --topics ./collections/beir-v1.0.0/combined_queries/$c/queries.jsonl \
    --index "${c}_sparse" "${c}_dense" \
    --pretokenized \
    --output runs/postgres-beir-$c-hybrid.txt \
    --db-type postgres
done
The --index flag accepts two table names: the sparse index (used for BM25) and the dense index (used for cosine similarity). The combined query file provides both tokenized text and the embedding vector for each query in a single pass.Alternatively, run the dedicated script:
bash ./scripts/beir/search_hybrid.sh > logs/search_hybrid.txt
4

Evaluate with trec_eval

Evaluate all hybrid run files using Pyserini’s trec_eval wrapper:
CORPORA=(nfcorpus scifact arguana cqadupstack-mathematica cqadupstack-webmasters cqadupstack-android scidocs cqadupstack-programmers cqadupstack-gis cqadupstack-physics cqadupstack-english cqadupstack-stats cqadupstack-gaming cqadupstack-unix cqadupstack-wordpress fiqa cqadupstack-tex)
for c in "${CORPORA[@]}"
do
    echo $c

    echo "duckdb"
    python -m pyserini.eval.trec_eval -c -m ndcg_cut.10 tools/topics-and-qrels/qrels.beir-v1.0.0-$c.test.txt runs/duckdb-beir-$c-hybrid.txt

    echo "postgres"
    python -m pyserini.eval.trec_eval -c -m ndcg_cut.10 tools/topics-and-qrels/qrels.beir-v1.0.0-$c.test.txt runs/postgres-beir-$c-hybrid.txt
done
Alternatively, run the dedicated script:
bash ./scripts/beir/eval_hybrid.sh > logs/eval_hybrid.txt

Results

The following nDCG@10 scores are reproducible with the commands above. Datasets not included in the hybrid experiments are marked with a dash (-).
CorpusDuckDBPostgreSQL
nfcorpus0.36210.3626
fiqa0.36830.2881
arguana0.50630.3449
cqadupstack-android0.46530.4117
cqadupstack-english0.44360.3913
cqadupstack-gaming0.56280.5022
cqadupstack-gis0.36830.3290
cqadupstack-mathematica0.27440.2325
cqadupstack-physics0.41380.3593
cqadupstack-programmers0.37320.3293
cqadupstack-stats0.34000.3130
cqadupstack-tex0.29300.2581
cqadupstack-unix0.36200.3302
cqadupstack-webmasters0.37230.3391
cqadupstack-wordpress0.33620.2805
scidocs0.19430.1750
scifact0.74400.6800

Comparison to sparse and dense alone

RRF fusion generally improves over either individual method on most datasets. The table below shows nDCG@10 for all three methods on the DuckDB backend for the datasets covered by all three experiments:
CorpusSparse (BM25)Dense (BGE)Hybrid (RRF)
nfcorpus0.32060.37350.3621
fiqa0.23780.40650.3683
arguana0.31790.63610.5063
cqadupstack-android0.38120.50750.4653
cqadupstack-english0.34410.48570.4436
cqadupstack-gaming0.48270.59650.5628
cqadupstack-gis0.28930.41270.3683
cqadupstack-mathematica0.20360.31630.2744
cqadupstack-physics0.32130.47220.4138
cqadupstack-programmers0.28030.42420.3732
cqadupstack-stats0.27280.37320.3400
cqadupstack-tex0.22560.31150.2930
cqadupstack-unix0.27790.42190.3620
cqadupstack-webmasters0.30700.40650.3723
cqadupstack-wordpress0.24850.35470.3362
scidocs0.15020.21700.1943
scifact0.67950.74080.7440
RRF consistently outperforms sparse retrieval alone. On most datasets it falls between sparse and dense, or exceeds dense (as with scifact). The degree of improvement depends on how complementary the sparse and dense rankings are for each domain.

Build docs developers (and LLMs) love