Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/castorini/quackir/llms.txt

Use this file to discover all available pages before exploring further.

DuckDBIndexer creates and populates BM25 (sparse) or vector (dense) indexes stored in a DuckDB database file. It implements the abstract Indexer base class and supports loading data from both JSONL and Parquet files.
from quackir.index import DuckDBIndexer

Constructor

DuckDBIndexer(db_path="duck.db")
Opens a DuckDB connection to the specified file. The file is created if it does not already exist.
db_path
string
default:"duck.db"
Path to the DuckDB database file.

Methods

init_table

indexer.init_table(table_name, index_type, embedding_dim=768)
Drops any existing table with the given name and creates a new one with the schema appropriate for the requested index type.
  • Sparse schema: (id VARCHAR, contents VARCHAR)
  • Dense schema: (id VARCHAR, embedding DOUBLE[embedding_dim])
table_name
string
required
Name of the table to create.
index_type
IndexType
required
IndexType.SPARSE or IndexType.DENSE.
embedding_dim
number
default:"768"
Dimension of the embedding vectors. Only used when index_type is IndexType.DENSE.

load_table

indexer.load_table(table_name, file_path, index_type=None, pretokenized=False)
Dispatches to load_jsonl_table or load_parquet_table based on the file extension. If index_type is None, it is detected automatically via get_index_type.
Parquet loading is currently restricted to IndexType.DENSE. Passing a .parquet file with IndexType.SPARSE raises a ValueError.
table_name
string
required
Target table name.
file_path
string
required
Path to the .jsonl or .parquet source file.
index_type
IndexType
default:"None"
Override the index type. When None, get_index_type(table_name) is called to detect it.
pretokenized
boolean
default:"false"
When False, contents values are tokenized with Pyserini’s Lucene Analyzer before insertion. Set True if your data is already tokenized.

load_jsonl_table

indexer.load_jsonl_table(table_name, file_path, index_type, pretokenized=False)
Reads a JSONL file and inserts rows into the table. Each line must be a JSON object with an id field and either a contents field (sparse) or a vector field (dense).
table_name
string
required
Target table name.
file_path
string
required
Path to the .jsonl file.
index_type
IndexType
required
IndexType.SPARSE or IndexType.DENSE.
pretokenized
boolean
default:"false"
Skip tokenization when True. Applies to sparse indexing only.

load_parquet_table

indexer.load_parquet_table(table_name, file_path, index_type, pretokenized=False)
Reads a Parquet file using DuckDB’s native read_parquet function and inserts the first two columns as id and embedding. Only used for dense indexes.
table_name
string
required
Target table name.
file_path
string
required
Path to the .parquet file.
index_type
IndexType
required
Must be IndexType.DENSE.
pretokenized
boolean
default:"false"
Unused for Parquet loading; present for interface consistency.

fts_index

indexer.fts_index(table_name="corpus")
Creates a DuckDB FTS index on the contents column of the specified table using the following parameters:
  • stemmer = 'none'
  • stopwords = 'none'
  • strip_accents = 0
  • lower = 0
  • overwrite = 1
This matches the preprocessing applied during sparse indexing, where Pyserini’s Lucene Analyzer has already handled stemming and stopword removal.
table_name
string
default:"corpus"
Name of the sparse table to index.

get_index_type

indexer.get_index_type(table_name)
Inspects the table’s column names to determine its index type.
table_name
string
required
Table to inspect.
return
IndexType
IndexType.SPARSE if the table has a contents column; IndexType.DENSE if it has an embedding column.
Raises ValueError if neither column is found.

get_num_rows

indexer.get_num_rows(table_name)
Returns the number of rows in the given table.
table_name
string
required
Table to count.
return
number
Row count as an integer.

close

indexer.close()
Closes the underlying DuckDB connection. Always call this when done indexing to flush pending writes.

Examples

Sparse (BM25) indexing

from quackir import IndexType
from quackir.index import DuckDBIndexer

indexer = DuckDBIndexer(db_path="sparse.db")

indexer.init_table("corpus", IndexType.SPARSE)
indexer.load_table("corpus", "corpus.jsonl")  # auto-tokenizes contents
indexer.fts_index("corpus")

print(indexer.get_num_rows("corpus"), "documents indexed")
indexer.close()

Dense (vector) indexing from Parquet

from quackir import IndexType
from quackir.index import DuckDBIndexer

indexer = DuckDBIndexer(db_path="dense.db")

indexer.init_table("dense_corpus", IndexType.DENSE, embedding_dim=768)
indexer.load_table("dense_corpus", "embeddings.parquet")

print(indexer.get_num_rows("dense_corpus"), "vectors indexed")
indexer.close()

Pre-tokenized JSONL

from quackir import IndexType
from quackir.index import DuckDBIndexer

indexer = DuckDBIndexer(db_path="pretok.db")
indexer.init_table("corpus", IndexType.SPARSE)

# Data is already tokenized — skip the tokenization step
indexer.load_table("corpus", "pretokenized.jsonl", pretokenized=True)
indexer.fts_index("corpus")
indexer.close()

Build docs developers (and LLMs) love