Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/castorini/quackir/llms.txt

Use this file to discover all available pages before exploring further.

Indexing in QuackIR reads a corpus from JSONL or Parquet files, optionally tokenizes document text, and writes a structured table into a database that can later be queried by quackir.search. For sparse indexes the table stores id and contents columns; for dense indexes it stores id and embedding columns. Once the table is populated, fts_index builds the full-text or vector index on top.

Input data format

Each line must be a JSON object with id and contents fields:
{"id": "doc1", "contents": "A lobster roll is a seafood dish native to New England."}
{"id": "doc2", "contents": "DuckDB is an in-process analytical database management system."}
Unless --pretokenized is set, QuackIR tokenizes contents automatically using Pyserini’s default Lucene analyzer before writing to the database.
Dashes in table names are replaced with underscores automatically. For example, my-corpus becomes my_corpus. The same normalization applies during search.

Python API

DuckDBIndexer supports both sparse and dense indexes. Use IndexType.SPARSE for BM25 full-text retrieval.
from quackir.index import DuckDBIndexer
from quackir import IndexType

indexer = DuckDBIndexer(db_path="database.db")
indexer.init_table("corpus", IndexType.SPARSE)
indexer.load_table("corpus", "corpus.jsonl", IndexType.SPARSE)
indexer.fts_index("corpus")
indexer.close()
init_table drops any existing table with that name and creates a fresh (id VARCHAR, contents VARCHAR) schema. load_table tokenizes each document unless pretokenized=True is passed. fts_index calls DuckDB’s create_fts_index pragma with stemming and stopwords disabled, matching the behavior of Pyserini’s default analyzer.

CLI usage

python -m quackir.index \
  --db-type <duckdb|sqlite|postgres> \
  --input <path> \
  --index-type <sparse|dense> \
  [options]

Required arguments

--db-type
string
required
Database backend to use. Accepted values: duckdb, sqlite, postgres.
--input
string
required
Path to a JSONL or Parquet file, or a directory. When a directory is provided, every file ending in .jsonl or .parquet is processed; subdirectories and other files are skipped. Progress is printed after each file.
--index-type
string
required
Type of index to build. sparse creates an id/contents table and runs fts_index. dense creates an id/embedding table.

Database connection arguments

--db-path
string
default:"database.db"
Path to the database file. Used by DuckDB and SQLite. Ignored for PostgreSQL.
--db-name
string
default:"quackir"
PostgreSQL database name. Ignored for DuckDB and SQLite.
--db-user
string
default:"postgres"
PostgreSQL username. Ignored for DuckDB and SQLite.

Optional arguments

--index
string
default:"corpus"
Name of the table to create. Any existing table with this name is dropped. Dashes are replaced with underscores.
--pretokenized
boolean
default:"false"
When set, skips tokenization during indexing. Use this when your corpus has already been processed by quackir.analysis. Has no effect for dense indexes.
--dimension
integer
default:"768"
Embedding vector dimension. Only applies to dense indexes. Must match the dimensionality of your embeddings.

Examples

# Sparse BM25 index with DuckDB
python -m quackir.index \
  --db-type duckdb \
  --db-path database.db \
  --input corpus.jsonl \
  --index-type sparse \
  --index corpus

# Dense vector index with DuckDB (1024-dimensional embeddings)
python -m quackir.index \
  --db-type duckdb \
  --db-path database.db \
  --input embeddings.parquet \
  --index-type dense \
  --index corpus_dense \
  --dimension 1024

# Sparse index with PostgreSQL using pretokenized input
python -m quackir.index \
  --db-type postgres \
  --db-name quackir \
  --db-user postgres \
  --input tokenized_corpus.jsonl \
  --index-type sparse \
  --pretokenized

# Sparse index with SQLite
python -m quackir.index \
  --db-type sqlite \
  --db-path sqlite.db \
  --input corpus.jsonl \
  --index-type sparse
After indexing, run quackir.search against the same database file and table name. See searching indexes for details.

Build docs developers (and LLMs) love