NFCorpus is a full-text learning-to-rank dataset for medical information retrieval containing just 3,633 documents — small enough to run everything on a laptop. This guide walks through both sparse BM25 and dense BGE-base-en-v1.5 retrieval using DuckDB as the backend, demonstrating that a relational database can match the effectiveness of dedicated IR systems like Lucene and Faiss.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/castorini/quackir/llms.txt
Use this file to discover all available pages before exploring further.
The key insight from this guide: the same bi-encoder conceptual framework applies across implementations. Sparse and dense retrieval differ only in their encoder representations — sparse lexical vectors versus dense embedding vectors.
Learning outcomes
After completing this guide, you will be able to:- Index NFCorpus in DuckDB with QuackIR and build an FTS index for sparse retrieval.
- Encode documents and queries with the BGE-base-en-v1.5 model using Pyserini, producing L2-normalized 768-dimensional vectors.
- Compute query–document scores for dense retrieval using cosine similarity.
- Write TREC-format run files for both sparse and dense retrieval.
- Evaluate runs with
trec_eval(nDCG@10) and compare to Lucene/Faiss baselines.
Installation
Make sure QuackIR is installed before proceeding. See the installation guide for setup instructions. Ensure you are running commands inside your conda environment.Part 1: Sparse retrieval with BM25
Download the NFCorpus dataset
Fetch and extract the NFCorpus data:To see what the corpus looks like, inspect the first document:Each line is a JSON object with
_id, title, text, and metadata fields. For example:Prepare the corpus
QuackIR expects documents in Convert the queries from JSONL to TSV format:Convert the relevance judgments (qrels) to TREC format:
{"id": ..., "contents": ...} format. Run the following Python script to merge the title and text fields:Index the corpus
Index the documents into DuckDB and build the FTS index:Here is what each step does:
init_table— creates a DuckDB table with the appropriate schema for storing documents. For sparse retrieval, this includes a text column for contents.load_table— inserts all documents from the JSONL file into the database table.fts_index— builds the full-text search (FTS) index using DuckDB’s BM25-style scoring.
Run sparse retrieval
Run retrieval for all queries and write results to a TREC-format run file:QuackIR translates your Python calls into SQL queries that DuckDB executes using its FTS capabilities. You do not need to write any SQL yourself.Single-query exampleYou can also run retrieval for an individual query to inspect results:Expected output:You can verify these match the batch run:Notice how similar the QuackIR API is to Pyserini’s
LuceneSearcher interface — both provide a clean, Pythonic API even though they use different backends.Evaluate sparse retrieval
Evaluate using Expected result:This nDCG@10 score of 0.3206 is very close to the Lucene BM25 baseline of 0.3218. The small difference is due to minor formula variations between DuckDB and Lucene: DuckDB’s BM25 explicitly includes a
trec_eval:(k1 + 1) multiplier and does not use Lucene’s document-length caching strategy. Despite these implementation differences, the effectiveness is nearly identical.Part 2: Dense retrieval with BGE-base-en-v1.5
Dense retrieval uses theBAAI/bge-base-en-v1.5 encoder to produce 768-dimensional L2-normalized vectors. QuackIR does not include encoding functionality, so Pyserini handles the encoding step.
Encode documents with Pyserini
Encode the corpus using the BGE-base-en-v1.5 model:This takes a few minutes on a laptop since it performs neural inference on the CPU. Inspect the first output line to verify the encoding worked:You should see a JSON line with
id, contents, and a vector field containing 768 floats.Convert embeddings to Parquet
QuackIR’s DuckDB indexer expects documents with only
id and vector fields. Convert the Pyserini JSONL output to Parquet format:You can keep the data in JSONL format instead of converting to Parquet, as long as you include only the
id and vector fields. Parquet is used here to demonstrate that the indexer can read it.Index embeddings into DuckDB
Load the pre-encoded vectors into a DuckDB table:Here is what each step does:
init_table— creates a DuckDB table with columns for document ID and a 768-dimensional embedding vector.load_table— reads the Parquet file and inserts the pre-encoded vectors.
fts_index step for dense retrieval. Dense retrieval uses vector similarity instead of BM25. This step completes in a few seconds since it only loads precomputed vectors.Run dense retrieval
Run retrieval using the encoded query vectors:QuackIR uses DuckDB’s
array_cosine_similarity function for vector similarity. Under the hood it performs an exact brute-force search over all document vectors.Because documents and queries are encoded with
--l2-norm, all embeddings are unit vectors. Cosine similarity then equals dot product: cos(θ) = u · v when ‖u‖ = ‖v‖ = 1.Summary of results
| Model | Backend | nDCG@10 | Baseline |
|---|---|---|---|
| BM25 (sparse) | DuckDB | 0.3206 | Lucene: 0.3218 |
| BGE-base-en-v1.5 (dense) | DuckDB | 0.3808 | Faiss: 0.3808 |
What have we learned?
- Sparse retrieval and dense retrieval are both instantiations of a bi-encoder architecture. The only difference is the encoder: sparse uses lexical term vectors, dense uses neural embeddings.
- With DuckDB, you build an FTS index for sparse and load pre-encoded embeddings for dense — there is no separate search engine required.
- DuckDB achieves nDCG@10 = 0.3206 for sparse (vs. Lucene’s 0.3218) and 0.3808 for dense (identical to the Faiss baseline). Relational databases are viable for retrieval, especially for RAG applications.
- For enterprises with existing relational databases, QuackIR adds retrieval capability without introducing new infrastructure like Elasticsearch or dedicated vector databases.
Reproduction log
Before moving on, add an entry to the Reproduction Log at the bottom of the source document: useyyyy-mm-dd, a commit ID from the main trunk of QuackIR, and its 7-hexadecimal prefix as the link anchor text.
- Results reproduced by @brandonzhou2002 on 2025-10-30 (commit
c9a80ed)