Indexing in QuackIR reads a corpus from JSONL or Parquet files, optionally tokenizes document text, and writes a structured table into a database that can later be queried byDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/castorini/quackir/llms.txt
Use this file to discover all available pages before exploring further.
quackir.search. For sparse indexes the table stores id and contents columns; for dense indexes it stores id and embedding columns. Once the table is populated, fts_index builds the full-text or vector index on top.
Input data format
- Sparse (JSONL)
- Dense (JSONL)
- Dense (Parquet)
Each line must be a JSON object with Unless
id and contents fields:--pretokenized is set, QuackIR tokenizes contents automatically using Pyserini’s default Lucene analyzer before writing to the database.Dashes in table names are replaced with underscores automatically. For example,
my-corpus becomes my_corpus. The same normalization applies during search.Python API
- DuckDB (sparse)
- DuckDB (dense)
- PostgreSQL
- SQLite
DuckDBIndexer supports both sparse and dense indexes. Use IndexType.SPARSE for BM25 full-text retrieval.init_table drops any existing table with that name and creates a fresh (id VARCHAR, contents VARCHAR) schema. load_table tokenizes each document unless pretokenized=True is passed. fts_index calls DuckDB’s create_fts_index pragma with stemming and stopwords disabled, matching the behavior of Pyserini’s default analyzer.CLI usage
Required arguments
Database backend to use. Accepted values:
duckdb, sqlite, postgres.Path to a JSONL or Parquet file, or a directory. When a directory is provided, every file ending in
.jsonl or .parquet is processed; subdirectories and other files are skipped. Progress is printed after each file.Type of index to build.
sparse creates an id/contents table and runs fts_index. dense creates an id/embedding table.Database connection arguments
Path to the database file. Used by DuckDB and SQLite. Ignored for PostgreSQL.
PostgreSQL database name. Ignored for DuckDB and SQLite.
PostgreSQL username. Ignored for DuckDB and SQLite.
Optional arguments
Name of the table to create. Any existing table with this name is dropped. Dashes are replaced with underscores.
When set, skips tokenization during indexing. Use this when your corpus has already been processed by
quackir.analysis. Has no effect for dense indexes.Embedding vector dimension. Only applies to dense indexes. Must match the dimensionality of your embeddings.
Examples
After indexing, run
quackir.search against the same database file and table name. See searching indexes for details.