Get started with QuackIR in minutes

This quickstart walks you through indexing a corpus and running your first retrieval query with QuackIR. By the end you will have a working sparse or dense retrieval pipeline running locally using DuckDB — no server setup required.

Input documents must be in JSONL format. Each line should be a JSON object with an id field and a contents field (for sparse) or vector field (for dense):

{"id": "doc1", "contents": "the quick brown fox jumps over the lazy dog"}

Sparse retrieval with DuckDB

Sparse retrieval uses BM25 full-text search, which is fast and requires no embedding model.

Index your corpus

Load a JSONL corpus into DuckDB and build the full-text search index.

from quackir.index import DuckDBIndexer
from quackir import IndexType

table_name = "corpus"
index_type = IndexType.SPARSE
corpus_file = "corpus.jsonl"  # path to your JSONL file

indexer = DuckDBIndexer()
indexer.init_table(table_name, index_type)
indexer.load_table(table_name, corpus_file)
indexer.fts_index(table_name)

indexer.close()

init_table creates the schema, load_table inserts documents (tokenizing text automatically), and fts_index builds the BM25 index.

Search the index

Query the index using BM25 sparse retrieval.

from quackir.search import DuckDBSearcher
from quackir import SearchType

table_name = "corpus"
query = "what is a lobster roll"

searcher = DuckDBSearcher()
results = searcher.search(
    SearchType.SPARSE,
    query_string=query,
    table_names=[table_name]
)
print(results)  # list of (doc_id, score) tuples

searcher.close()

Dense retrieval with DuckDB

Dense retrieval uses pre-encoded vector embeddings and cosine similarity scoring.

Encode your corpus

Use Pyserini to encode documents with a model like BGE-base-en-v1.5:

python -m pyserini.encode \
    input   --corpus corpus.jsonl \
                    --fields contents \
    output  --embeddings indexes/corpus-embeddings \
    encoder --encoder BAAI/bge-base-en-v1.5 --l2-norm \
                    --device cpu \
                    --pooling mean \
                    --batch 32

Index the embeddings

Load the encoded embeddings into DuckDB:

from quackir.index import DuckDBIndexer
from quackir import IndexType

indexer = DuckDBIndexer()
indexer.init_table("corpus_dense", IndexType.DENSE, embedding_dim=768)
indexer.load_table("corpus_dense", "indexes/corpus-embeddings/embeddings.jsonl")
indexer.close()

Search with dense retrieval

Run a dense vector search using a pre-encoded query vector:

from quackir.search import DuckDBSearcher
from quackir import SearchType
import json

searcher = DuckDBSearcher()

with open("indexes/query-embeddings/embeddings.jsonl") as f:
    for line in f:
        query = json.loads(line)
        results = searcher.search(
            SearchType.DENSE,
            query_id=query["id"],
            query_embedding=query["vector"],
            table_names=["corpus_dense"],
            top_n=10,
        )
        print(query["id"], results[:3])

searcher.close()

CLI usage

QuackIR also exposes a command-line interface for batch indexing and searching:

# Index a corpus (sparse)
python -m quackir.index \
  --db-type duckdb \
  --db-path database.db \
  --input corpus.jsonl \
  --index-type sparse \
  --index corpus

# Search with a topics file (sparse)
python -m quackir.search \
  --db-type duckdb \
  --db-path database.db \
  --topics queries.jsonl \
  --search-method sparse \
  --index corpus \
  --output run.txt

Results are saved in TREC run format: query_id Q0 doc_id rank score run_tag.

Evaluate results

Use Pyserini’s trec_eval wrapper to measure retrieval effectiveness:

python -m pyserini.eval.trec_eval -c -m ndcg_cut.10 qrels.txt run.txt

For a complete end-to-end worked example including data preparation and evaluation on a real IR benchmark, see the NFCorpus experiment walkthrough.

Get Started

Guides

Experiments

Get started with QuackIR in minutes

Sparse retrieval with DuckDB

Dense retrieval with DuckDB

CLI usage

Evaluate results

Build docs developers (and LLMs) love

Get Started

Guides

Experiments

Documentation Index

​Sparse retrieval with DuckDB

​Dense retrieval with DuckDB

​CLI usage

​Evaluate results

Build docs developers (and LLMs) love

Sparse retrieval with DuckDB

Dense retrieval with DuckDB

CLI usage

Evaluate results