Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/castorini/quackir/llms.txt

Use this file to discover all available pages before exploring further.

QuackIR is a Python toolkit for reproducible information retrieval (IR) research built on top of relational database management systems. It supports sparse BM25 retrieval, dense vector search, and hybrid retrieval via Reciprocal Rank Fusion — all without requiring a dedicated search engine or vector database.

Installation

Set up QuackIR with conda and install all dependencies including DuckDB, PostgreSQL, and Pyserini.

Quickstart

Index a corpus and run your first sparse or dense retrieval query in minutes.

Guides

Learn how to index, search, and analyze text across DuckDB, SQLite, and PostgreSQL.

API Reference

Explore the full Python API for indexers, searchers, and analysis utilities.

What is QuackIR?

QuackIR demonstrates that relational database management systems (RDBMSes) like DuckDB can perform information retrieval with effectiveness comparable to established IR toolkits such as Lucene and Faiss. It is designed for researchers who want reproducible IR experiments and for practitioners who want to add retrieval capabilities to an existing relational database infrastructure.

Sparse Retrieval

BM25 full-text search using DuckDB FTS, SQLite FTS5, or PostgreSQL GIN indexes.

Dense Retrieval

Cosine similarity vector search with pre-encoded embeddings in DuckDB or PostgreSQL.

Hybrid Retrieval

Reciprocal Rank Fusion combining sparse and dense results in DuckDB or PostgreSQL.

Getting started

1

Install QuackIR

Clone the repository and install all dependencies using conda and pip.
git clone https://github.com/castorini/quackir.git --recurse-submodules
conda create -n quackir python=3.10
conda activate quackir
pip install -r requirements.txt
2

Index your corpus

Load a JSONL corpus into DuckDB and build a full-text search index.
from quackir.index import DuckDBIndexer
from quackir import IndexType

indexer = DuckDBIndexer()
indexer.init_table("corpus", IndexType.SPARSE)
indexer.load_table("corpus", "corpus.jsonl")
indexer.fts_index("corpus")
indexer.close()
3

Run your first search

Query the index using BM25 sparse retrieval.
from quackir.search import DuckDBSearcher
from quackir import SearchType

searcher = DuckDBSearcher()
results = searcher.search(SearchType.SPARSE, query_string="your query here", table_names=["corpus"])
print(results)
searcher.close()
4

Evaluate results

Use Pyserini’s trec_eval to measure retrieval effectiveness (nDCG, MAP, etc.).
python -m pyserini.eval.trec_eval -c -m ndcg_cut.10 qrels.txt run.txt

Supported databases

FeatureDuckDBSQLitePostgreSQL
Sparse (BM25)YesYesYes
Dense (vector)YesNoYes
Hybrid (RRF)YesNoYes
DuckDB requires no server setup and is the recommended starting point for most use cases. It is the fastest way to get running with QuackIR.

Build docs developers (and LLMs) love