Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/castorini/quackir/llms.txt

Use this file to discover all available pages before exploring further.

DuckDBSearcher queries BM25 (sparse), cosine similarity (dense), or Reciprocal Rank Fusion (hybrid) indexes stored in a DuckDB database file. It implements the abstract Searcher base class.
from quackir.search import DuckDBSearcher

Constructor

DuckDBSearcher(db_path="duck.db")
Opens a DuckDB connection to the specified file.
db_path
string
default:"duck.db"
Path to the DuckDB database file produced by DuckDBIndexer.

Methods

searcher.search(
    method,
    query_id=None,
    query_string=None,
    query_embedding=None,
    top_n=5,
    tokenize_query=True,
    table_names=["corpus"],
    rrf_k=60,
)
Main entry point for retrieval. Dispatches to fts_search, embedding_search, or rrf_search based on method, then filters out the query_id document from the results (useful when the query is itself a document in the index).
method
SearchType
required
SearchType.SPARSE, SearchType.DENSE, or SearchType.HYBRID.
query_id
string
default:"None"
Document ID to exclude from results. Pass the query document’s own ID to avoid self-matches.
query_string
string
default:"None"
Text query for sparse or hybrid search. Required when method is SPARSE or HYBRID.
query_embedding
number[]
default:"None"
Query vector (list of floats) for dense or hybrid search. Required when method is DENSE or HYBRID.
top_n
number
default:"5"
Maximum number of results to return.
tokenize_query
boolean
default:"true"
When True and method is SPARSE or HYBRID, the query_string is tokenized with Pyserini’s Lucene Analyzer before querying.
table_names
string[]
default:"[\"corpus\"]"
Table(s) to search. For HYBRID, provide two names: [sparse_table, dense_table].
rrf_k
number
default:"60"
RRF rank smoothing constant. Only used when method is SearchType.HYBRID.
return
list
List of (doc_id, score) tuples ordered by descending score. The query_id document is excluded if provided.

searcher.fts_search(query_string, top_n=5, table_name="corpus")
Executes a BM25 full-text search using DuckDB’s FTS extension with parameters k=0.9 and b=0.4.
query_string
string
required
Pre-processed query string (tokenized if called via search()).
top_n
number
default:"5"
Maximum number of results to return.
table_name
string
default:"corpus"
Name of the sparse table to search.
return
list
List of (id, score) tuples.

searcher.embedding_search(query_embedding, top_n=5, table_name="corpus")
Computes cosine similarity between the query vector and all stored embeddings using array_cosine_similarity.
query_embedding
number[]
required
Query vector as a list of floats.
top_n
number
default:"5"
Maximum number of results to return.
table_name
string
default:"corpus"
Name of the dense table to search.
return
list
List of (id, score) tuples ordered by descending cosine similarity.

searcher.rrf_search(query_string, query_embedding, top_n=5, k=60, table_names=["sparse", "dense"])
Combines BM25 and cosine similarity rankings using Reciprocal Rank Fusion (RRF). Each result’s RRF score is:
rrf_score = 1 / (k + sparse_rank) + 1 / (k + dense_rank)
The sparse and dense tables are auto-detected from table_names using get_search_type.
query_string
string
required
Query string for BM25 retrieval. Should already be tokenized.
query_embedding
number[]
required
Query vector for cosine similarity retrieval.
top_n
number
default:"5"
Number of candidates fetched from each sub-ranker before fusion.
k
number
default:"60"
RRF rank smoothing constant.
table_names
string[]
default:"[\"sparse\", \"dense\"]"
Two table names. The method detects which is sparse and which is dense automatically.
return
list
List of (id, rrf_score) tuples ordered by descending RRF score.

get_search_type

searcher.get_search_type(table_name)
Inspects the table’s column names to determine whether it is a sparse or dense table.
table_name
string
required
Table to inspect.
return
SearchType
SearchType.SPARSE if a contents column exists; SearchType.DENSE if an embedding column exists.
Raises ValueError if neither column is found.

filter_id

DuckDBSearcher.filter_id(results, query_id)
Static method. Removes the entry whose id matches query_id from a results list. Called automatically by search().
results
list
required
List of (id, score) tuples.
query_id
string
required
Document ID to remove.
return
list
Filtered list with the matching entry removed.

close

searcher.close()
Closes the underlying DuckDB connection.

Examples

from quackir import SearchType
from quackir.search import DuckDBSearcher

searcher = DuckDBSearcher("sparse.db")

results = searcher.search(
    method=SearchType.SPARSE,
    query_string="information retrieval benchmarks",
    top_n=10,
)

for doc_id, score in results:
    print(doc_id, score)

searcher.close()
from quackir import SearchType
from quackir.search import DuckDBSearcher

query_vector = [0.12, -0.34, 0.56]  # replace with a real embedding

searcher = DuckDBSearcher("dense.db")

results = searcher.search(
    method=SearchType.DENSE,
    query_embedding=query_vector,
    top_n=10,
    table_names=["dense_corpus"],
)

for doc_id, score in results:
    print(doc_id, score)

searcher.close()
from quackir import SearchType
from quackir.search import DuckDBSearcher

searcher = DuckDBSearcher("hybrid.db")

results = searcher.search(
    method=SearchType.HYBRID,
    query_string="neural retrieval",
    query_embedding=[0.12, -0.34, 0.56],
    top_n=10,
    table_names=["sparse_corpus", "dense_corpus"],
    rrf_k=60,
)

for doc_id, score in results:
    print(doc_id, score)

searcher.close()

Build docs developers (and LLMs) love