Documentation Index
Fetch the complete documentation index at: https://mintlify.com/castorini/quackir/llms.txt
Use this file to discover all available pages before exploring further.
DuckDBIndexer creates and populates BM25 (sparse) or vector (dense) indexes stored in a DuckDB database file. It implements the abstract Indexer base class and supports loading data from both JSONL and Parquet files.
Constructor
Path to the DuckDB database file.
Methods
init_table
- Sparse schema:
(id VARCHAR, contents VARCHAR) - Dense schema:
(id VARCHAR, embedding DOUBLE[embedding_dim])
Name of the table to create.
IndexType.SPARSE or IndexType.DENSE.Dimension of the embedding vectors. Only used when
index_type is IndexType.DENSE.load_table
load_jsonl_table or load_parquet_table based on the file extension. If index_type is None, it is detected automatically via get_index_type.
Parquet loading is currently restricted to
IndexType.DENSE. Passing a .parquet file with IndexType.SPARSE raises a ValueError.Target table name.
Path to the
.jsonl or .parquet source file.Override the index type. When
None, get_index_type(table_name) is called to detect it.When
False, contents values are tokenized with Pyserini’s Lucene Analyzer before insertion. Set True if your data is already tokenized.load_jsonl_table
id field and either a contents field (sparse) or a vector field (dense).
Target table name.
Path to the
.jsonl file.IndexType.SPARSE or IndexType.DENSE.Skip tokenization when
True. Applies to sparse indexing only.load_parquet_table
read_parquet function and inserts the first two columns as id and embedding. Only used for dense indexes.
Target table name.
Path to the
.parquet file.Must be
IndexType.DENSE.Unused for Parquet loading; present for interface consistency.
fts_index
contents column of the specified table using the following parameters:
stemmer = 'none'stopwords = 'none'strip_accents = 0lower = 0overwrite = 1
Name of the sparse table to index.
get_index_type
Table to inspect.
IndexType.SPARSE if the table has a contents column; IndexType.DENSE if it has an embedding column.ValueError if neither column is found.
get_num_rows
Table to count.
Row count as an integer.