DuckDBIndexer API reference

DuckDBIndexer creates and populates BM25 (sparse) or vector (dense) indexes stored in a DuckDB database file. It implements the abstract Indexer base class and supports loading data from both JSONL and Parquet files.

from quackir.index import DuckDBIndexer

Constructor

DuckDBIndexer(db_path="duck.db")

Opens a DuckDB connection to the specified file. The file is created if it does not already exist.

db_path

string

default:"duck.db"

Path to the DuckDB database file.

Methods

init_table

indexer.init_table(table_name, index_type, embedding_dim=768)

Drops any existing table with the given name and creates a new one with the schema appropriate for the requested index type.

Sparse schema: (id VARCHAR, contents VARCHAR)
Dense schema: (id VARCHAR, embedding DOUBLE[embedding_dim])

table_name

string

required

Name of the table to create.

index_type

IndexType

required

IndexType.SPARSE or IndexType.DENSE.

embedding_dim

number

default:"768"

Dimension of the embedding vectors. Only used when index_type is IndexType.DENSE.

load_table

indexer.load_table(table_name, file_path, index_type=None, pretokenized=False)

Dispatches to load_jsonl_table or load_parquet_table based on the file extension. If index_type is None, it is detected automatically via get_index_type.

Parquet loading is currently restricted to IndexType.DENSE. Passing a .parquet file with IndexType.SPARSE raises a ValueError.

table_name

string

required

Target table name.

file_path

string

required

Path to the .jsonl or .parquet source file.

index_type

IndexType

default:"None"

Override the index type. When None, get_index_type(table_name) is called to detect it.

pretokenized

boolean

default:"false"

When False, contents values are tokenized with Pyserini’s Lucene Analyzer before insertion. Set True if your data is already tokenized.

load_jsonl_table

indexer.load_jsonl_table(table_name, file_path, index_type, pretokenized=False)

Reads a JSONL file and inserts rows into the table. Each line must be a JSON object with an id field and either a contents field (sparse) or a vector field (dense).

table_name

string

required

Target table name.

file_path

string

required

Path to the .jsonl file.

index_type

IndexType

required

IndexType.SPARSE or IndexType.DENSE.

pretokenized

boolean

default:"false"

Skip tokenization when True. Applies to sparse indexing only.

load_parquet_table

indexer.load_parquet_table(table_name, file_path, index_type, pretokenized=False)

Reads a Parquet file using DuckDB’s native read_parquet function and inserts the first two columns as id and embedding. Only used for dense indexes.

table_name

string

required

Target table name.

file_path

string

required

Path to the .parquet file.

index_type

IndexType

required

Must be IndexType.DENSE.

pretokenized

boolean

default:"false"

Unused for Parquet loading; present for interface consistency.

fts_index

indexer.fts_index(table_name="corpus")

Creates a DuckDB FTS index on the contents column of the specified table using the following parameters:

stemmer = 'none'
stopwords = 'none'
strip_accents = 0
lower = 0
overwrite = 1

This matches the preprocessing applied during sparse indexing, where Pyserini’s Lucene Analyzer has already handled stemming and stopword removal.

table_name

string

default:"corpus"

Name of the sparse table to index.

get_index_type

indexer.get_index_type(table_name)

Inspects the table’s column names to determine its index type.

table_name

string

required

Table to inspect.

return

IndexType

IndexType.SPARSE if the table has a contents column; IndexType.DENSE if it has an embedding column.

Raises ValueError if neither column is found.

get_num_rows

indexer.get_num_rows(table_name)

Returns the number of rows in the given table.

table_name

string

required

Table to count.

return

number

Row count as an integer.

close

indexer.close()

Closes the underlying DuckDB connection. Always call this when done indexing to flush pending writes.

Examples

Sparse (BM25) indexing

from quackir import IndexType
from quackir.index import DuckDBIndexer

indexer = DuckDBIndexer(db_path="sparse.db")

indexer.init_table("corpus", IndexType.SPARSE)
indexer.load_table("corpus", "corpus.jsonl")  # auto-tokenizes contents
indexer.fts_index("corpus")

print(indexer.get_num_rows("corpus"), "documents indexed")
indexer.close()

Dense (vector) indexing from Parquet

from quackir import IndexType
from quackir.index import DuckDBIndexer

indexer = DuckDBIndexer(db_path="dense.db")

indexer.init_table("dense_corpus", IndexType.DENSE, embedding_dim=768)
indexer.load_table("dense_corpus", "embeddings.parquet")

print(indexer.get_num_rows("dense_corpus"), "vectors indexed")
indexer.close()

Pre-tokenized JSONL

from quackir import IndexType
from quackir.index import DuckDBIndexer

indexer = DuckDBIndexer(db_path="pretok.db")
indexer.init_table("corpus", IndexType.SPARSE)

# Data is already tokenized — skip the tokenization step
indexer.load_table("corpus", "pretokenized.jsonl", pretokenized=True)
indexer.fts_index("corpus")
indexer.close()

Core

Indexers

Searchers

Analysis

DuckDBIndexer API reference

Constructor

Methods

init_table

load_table

load_jsonl_table

load_parquet_table

fts_index

get_index_type

get_num_rows

close

Examples

Sparse (BM25) indexing

Dense (vector) indexing from Parquet

Pre-tokenized JSONL

Build docs developers (and LLMs) love

Core

Indexers

Searchers

Analysis

Documentation Index

​Constructor

​Methods

​init_table

​load_table

​load_jsonl_table

​load_parquet_table

​fts_index

​get_index_type

​get_num_rows

​close

​Examples

​Sparse (BM25) indexing

​Dense (vector) indexing from Parquet

​Pre-tokenized JSONL

Build docs developers (and LLMs) love

Constructor

Methods

init_table

load_table

load_jsonl_table

load_parquet_table

fts_index

get_index_type

get_num_rows

close

Examples

Sparse (BM25) indexing

Dense (vector) indexing from Parquet

Pre-tokenized JSONL