Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/castorini/quackir/llms.txt

Use this file to discover all available pages before exploring further.

This quickstart walks you through indexing a corpus and running your first retrieval query with QuackIR. By the end you will have a working sparse or dense retrieval pipeline running locally using DuckDB — no server setup required.
Input documents must be in JSONL format. Each line should be a JSON object with an id field and a contents field (for sparse) or vector field (for dense):
{"id": "doc1", "contents": "the quick brown fox jumps over the lazy dog"}

Sparse retrieval with DuckDB

Sparse retrieval uses BM25 full-text search, which is fast and requires no embedding model.
1

Index your corpus

Load a JSONL corpus into DuckDB and build the full-text search index.
from quackir.index import DuckDBIndexer
from quackir import IndexType

table_name = "corpus"
index_type = IndexType.SPARSE
corpus_file = "corpus.jsonl"  # path to your JSONL file

indexer = DuckDBIndexer()
indexer.init_table(table_name, index_type)
indexer.load_table(table_name, corpus_file)
indexer.fts_index(table_name)

indexer.close()
init_table creates the schema, load_table inserts documents (tokenizing text automatically), and fts_index builds the BM25 index.
2

Search the index

Query the index using BM25 sparse retrieval.
from quackir.search import DuckDBSearcher
from quackir import SearchType

table_name = "corpus"
query = "what is a lobster roll"

searcher = DuckDBSearcher()
results = searcher.search(
    SearchType.SPARSE,
    query_string=query,
    table_names=[table_name]
)
print(results)  # list of (doc_id, score) tuples

searcher.close()

Dense retrieval with DuckDB

Dense retrieval uses pre-encoded vector embeddings and cosine similarity scoring.
1

Encode your corpus

Use Pyserini to encode documents with a model like BGE-base-en-v1.5:
python -m pyserini.encode \
    input   --corpus corpus.jsonl \
                    --fields contents \
    output  --embeddings indexes/corpus-embeddings \
    encoder --encoder BAAI/bge-base-en-v1.5 --l2-norm \
                    --device cpu \
                    --pooling mean \
                    --batch 32
2

Index the embeddings

Load the encoded embeddings into DuckDB:
from quackir.index import DuckDBIndexer
from quackir import IndexType

indexer = DuckDBIndexer()
indexer.init_table("corpus_dense", IndexType.DENSE, embedding_dim=768)
indexer.load_table("corpus_dense", "indexes/corpus-embeddings/embeddings.jsonl")
indexer.close()
3

Search with dense retrieval

Run a dense vector search using a pre-encoded query vector:
from quackir.search import DuckDBSearcher
from quackir import SearchType
import json

searcher = DuckDBSearcher()

with open("indexes/query-embeddings/embeddings.jsonl") as f:
    for line in f:
        query = json.loads(line)
        results = searcher.search(
            SearchType.DENSE,
            query_id=query["id"],
            query_embedding=query["vector"],
            table_names=["corpus_dense"],
            top_n=10,
        )
        print(query["id"], results[:3])

searcher.close()

CLI usage

QuackIR also exposes a command-line interface for batch indexing and searching:
# Index a corpus (sparse)
python -m quackir.index \
  --db-type duckdb \
  --db-path database.db \
  --input corpus.jsonl \
  --index-type sparse \
  --index corpus

# Search with a topics file (sparse)
python -m quackir.search \
  --db-type duckdb \
  --db-path database.db \
  --topics queries.jsonl \
  --search-method sparse \
  --index corpus \
  --output run.txt
Results are saved in TREC run format: query_id Q0 doc_id rank score run_tag.

Evaluate results

Use Pyserini’s trec_eval wrapper to measure retrieval effectiveness:
python -m pyserini.eval.trec_eval -c -m ndcg_cut.10 qrels.txt run.txt
For a complete end-to-end worked example including data preparation and evaluation on a real IR benchmark, see the NFCorpus experiment walkthrough.

Build docs developers (and LLMs) love