Documentation Index
Fetch the complete documentation index at: https://mintlify.com/castorini/quackir/llms.txt
Use this file to discover all available pages before exploring further.
This quickstart walks you through indexing a corpus and running your first retrieval query with QuackIR. By the end you will have a working sparse or dense retrieval pipeline running locally using DuckDB — no server setup required.
Input documents must be in JSONL format. Each line should be a JSON object with an id field and a contents field (for sparse) or vector field (for dense):{"id": "doc1", "contents": "the quick brown fox jumps over the lazy dog"}
Sparse retrieval with DuckDB
Sparse retrieval uses BM25 full-text search, which is fast and requires no embedding model.
Index your corpus
Load a JSONL corpus into DuckDB and build the full-text search index.from quackir.index import DuckDBIndexer
from quackir import IndexType
table_name = "corpus"
index_type = IndexType.SPARSE
corpus_file = "corpus.jsonl" # path to your JSONL file
indexer = DuckDBIndexer()
indexer.init_table(table_name, index_type)
indexer.load_table(table_name, corpus_file)
indexer.fts_index(table_name)
indexer.close()
init_table creates the schema, load_table inserts documents (tokenizing text automatically), and fts_index builds the BM25 index. Search the index
Query the index using BM25 sparse retrieval.from quackir.search import DuckDBSearcher
from quackir import SearchType
table_name = "corpus"
query = "what is a lobster roll"
searcher = DuckDBSearcher()
results = searcher.search(
SearchType.SPARSE,
query_string=query,
table_names=[table_name]
)
print(results) # list of (doc_id, score) tuples
searcher.close()
Dense retrieval with DuckDB
Dense retrieval uses pre-encoded vector embeddings and cosine similarity scoring.
Encode your corpus
Use Pyserini to encode documents with a model like BGE-base-en-v1.5:python -m pyserini.encode \
input --corpus corpus.jsonl \
--fields contents \
output --embeddings indexes/corpus-embeddings \
encoder --encoder BAAI/bge-base-en-v1.5 --l2-norm \
--device cpu \
--pooling mean \
--batch 32
Index the embeddings
Load the encoded embeddings into DuckDB:from quackir.index import DuckDBIndexer
from quackir import IndexType
indexer = DuckDBIndexer()
indexer.init_table("corpus_dense", IndexType.DENSE, embedding_dim=768)
indexer.load_table("corpus_dense", "indexes/corpus-embeddings/embeddings.jsonl")
indexer.close()
Search with dense retrieval
Run a dense vector search using a pre-encoded query vector:from quackir.search import DuckDBSearcher
from quackir import SearchType
import json
searcher = DuckDBSearcher()
with open("indexes/query-embeddings/embeddings.jsonl") as f:
for line in f:
query = json.loads(line)
results = searcher.search(
SearchType.DENSE,
query_id=query["id"],
query_embedding=query["vector"],
table_names=["corpus_dense"],
top_n=10,
)
print(query["id"], results[:3])
searcher.close()
CLI usage
QuackIR also exposes a command-line interface for batch indexing and searching:
# Index a corpus (sparse)
python -m quackir.index \
--db-type duckdb \
--db-path database.db \
--input corpus.jsonl \
--index-type sparse \
--index corpus
# Search with a topics file (sparse)
python -m quackir.search \
--db-type duckdb \
--db-path database.db \
--topics queries.jsonl \
--search-method sparse \
--index corpus \
--output run.txt
Results are saved in TREC run format: query_id Q0 doc_id rank score run_tag.
Evaluate results
Use Pyserini’s trec_eval wrapper to measure retrieval effectiveness:
python -m pyserini.eval.trec_eval -c -m ndcg_cut.10 qrels.txt run.txt