Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/castorini/quackir/llms.txt

Use this file to discover all available pages before exploring further.

SQLiteIndexer creates and populates BM25 (sparse) indexes stored in a SQLite database file. It implements the abstract Indexer base class and loads data from JSONL files. Dense indexing is not supported.
from quackir.index import SQLiteIndexer
SQLiteIndexer only supports IndexType.SPARSE. Attempting to create a dense table or load non-sparse data raises a ValueError.

Constructor

SQLiteIndexer(db_path="sqlite.db")
Opens a SQLite connection to the specified file. The file is created if it does not already exist.
db_path
string
default:"sqlite.db"
Path to the SQLite database file.

Methods

init_table

indexer.init_table(table_name, index_type, embedding_dim=768)
Drops any existing table with the given name and creates a new (id TEXT PRIMARY KEY, contents TEXT) table. Raises ValueError if index_type is not IndexType.SPARSE.
table_name
string
required
Name of the table to create.
index_type
IndexType
required
Must be IndexType.SPARSE. Any other value raises ValueError.
embedding_dim
number
default:"768"
Accepted for interface consistency but unused.

load_table

indexer.load_table(table_name, file_path, index_type=None, pretokenized=False)
Dispatches to load_jsonl_table based on the file extension. Parquet loading is not implemented.
table_name
string
required
Target table name.
file_path
string
required
Path to a .jsonl source file.
index_type
IndexType
default:"None"
Override the index type. When None, get_index_type(table_name) is called.
pretokenized
boolean
default:"false"
When False, contents values are tokenized with Pyserini’s Lucene Analyzer before insertion.

load_jsonl_table

indexer.load_jsonl_table(table_name, file_path, index_type, pretokenized=False)
Reads a JSONL file and inserts (id, contents) rows. Each line must be a JSON object with id and contents fields. Raises ValueError if index_type is not IndexType.SPARSE.
table_name
string
required
Target table name.
file_path
string
required
Path to the .jsonl file.
index_type
IndexType
required
Must be IndexType.SPARSE.
pretokenized
boolean
default:"false"
Skip tokenization when True.

fts_index

indexer.fts_index(table_name="corpus")
Creates an FTS5 virtual table named fts_{table_name} backed by the base table, using the porter tokenizer. The virtual table is populated immediately from the base table.
CREATE VIRTUAL TABLE fts_corpus USING fts5(
    id, contents,
    content='corpus', content_rowid='rowid', tokenize = 'porter'
)
The porter stemmer is used here at the SQLite FTS5 level. However, QuackIR’s tokenization pipeline (Pyserini Lucene Analyzer) already stems tokens before they are stored, so the FTS5 porter stemmer provides a second pass. For best retrieval consistency, use the default pretokenized=False so that index and query terms go through the same pipeline.
table_name
string
default:"corpus"
Name of the base table to build the FTS5 virtual table from.

get_index_type

indexer.get_index_type(table_name)
Inspects column names. Returns IndexType.SPARSE if a contents column is present, otherwise raises ValueError.
table_name
string
required
Table to inspect.
return
IndexType
Always IndexType.SPARSE for valid SQLite tables.

get_num_rows

indexer.get_num_rows(table_name)
Returns the number of rows in the given table.
table_name
string
required
Table to count.
return
number
Row count as an integer.

close

indexer.close()
Closes the underlying SQLite connection.

Example

from quackir import IndexType
from quackir.index import SQLiteIndexer

indexer = SQLiteIndexer(db_path="sparse.db")

indexer.init_table("corpus", IndexType.SPARSE)
indexer.load_table("corpus", "corpus.jsonl")  # auto-tokenizes contents
indexer.fts_index("corpus")

print(indexer.get_num_rows("corpus"), "documents indexed")
indexer.close()

Pre-tokenized JSONL

from quackir import IndexType
from quackir.index import SQLiteIndexer

indexer = SQLiteIndexer(db_path="pretok.db")
indexer.init_table("corpus", IndexType.SPARSE)
indexer.load_table("corpus", "pretokenized.jsonl", pretokenized=True)
indexer.fts_index("corpus")
indexer.close()

Build docs developers (and LLMs) love