Tokenize and analyze text for indexing with QuackIR

The quackir.analysis module wraps Pyserini’s default Lucene analyzer to tokenize text for sparse indexing and retrieval. The analyzer lowercases input text, removes English stopwords, and applies Porter stemming via get_lucene_analyzer() from Pyserini. The CLI tool also handles format conversion: it reads JSONL or TSV files, extracts the relevant text fields, tokenizes them, and writes output in the {"id": ..., "contents": ...} JSONL format that quackir.index expects for sparse indexes.

Tokenization is applied automatically during sparse indexing (quackir.index) and sparse search (quackir.search) unless the --pretokenized flag is set. You only need to run quackir.analysis directly if you want to pre-process data in a separate step, inspect the tokenized output, or reuse tokenized files across multiple indexing runs.

Python API

The tokenize function takes a string and returns the whitespace-joined token sequence produced by Pyserini’s default Lucene analyzer:

from quackir.analysis import tokenize

result = tokenize("What is a lobster roll")
print(result)  # whitespace-separated token string

def tokenize(to_tokenize: str) -> str:
    ...

Parameters

to_tokenize

string

required

The string to tokenize. Returns the analyzer’s tokens joined with single spaces.

CLI usage

python -m quackir.analysis \
  --input <path> \
  --output <path>

Required arguments

--input

string

required

Path to an input file or directory. Accepted file formats: .jsonl and .tsv (gzip-compressed variants such as .jsonl.gz and .tsv.gz are also supported). When a directory is given, every file containing .jsonl or .tsv in its name is processed; other files and subdirectories are skipped. A progress message is printed after each file.

--output

string

required

Path to the output JSONL file. Each line is written as {"id": "<id>", "contents": "<tokenized text>"}. This format is exactly what quackir.index expects for sparse indexes and what quackir.search expects for sparse retrieval queries.

Input field extraction rules

The field used as text input depends on the file type and the fields present.

JSONL
TSV

The first field of each JSON object is taken as the identifier. The text to tokenize is selected using the following priority:

If both title and text fields are present, their values are concatenated (title + " " + text) and tokenized. All other fields are ignored.
If a contents field is present, its value is tokenized. All other fields are ignored.
Otherwise, the values of all fields except the first (the id field) are concatenated and tokenized.

Example input variants:

{"id": "doc1", "title": "Lobster roll", "text": "A sandwich made with lobster meat."}

{"id": "doc2", "contents": "DuckDB is an in-process analytical database."}

{"docid": "doc3", "body": "Reciprocal rank fusion combines ranked lists."}

The first column is the identifier. All remaining columns are concatenated with a space and tokenized.

doc1	A sandwich made with lobster meat.
doc2	DuckDB is an in-process analytical database.

Output format

Every line of the output file is a JSON object with exactly two fields:

{"id": "doc1", "contents": "sandwich made lobster meat"}
{"id": "doc2", "contents": "duckdb in-process analyt databas"}

This output format is exactly what quackir.index expects for sparse indexing. Pass the output file to --input of quackir.index --index-type sparse --pretokenized to skip re-tokenization during indexing.

Example workflow

Tokenize a raw corpus

Run the analysis CLI to pre-process your corpus:

python -m quackir.analysis \
  --input raw_corpus.jsonl \
  --output tokenized_corpus.jsonl

Index with pretokenized flag

Pass --pretokenized to quackir.index so the already-tokenized contents are loaded as-is:

python -m quackir.index \
  --db-type duckdb \
  --db-path database.db \
  --input tokenized_corpus.jsonl \
  --index-type sparse \
  --pretokenized

Tokenize queries and search

Pre-process your query file in the same way, then search with --pretokenized:

python -m quackir.analysis \
  --input raw_queries.jsonl \
  --output tokenized_queries.jsonl

python -m quackir.search \
  --db-type duckdb \
  --db-path database.db \
  --topics tokenized_queries.jsonl \
  --search-method sparse \
  --pretokenized \
  --output run.txt

Get Started

Guides

Experiments

Tokenize and analyze text for indexing with QuackIR

Python API

CLI usage

Required arguments

Input field extraction rules

Output format

Example workflow

Build docs developers (and LLMs) love

Get Started

Guides

Experiments

Documentation Index

​Python API

​CLI usage

​Required arguments

​Input field extraction rules

​Output format

​Example workflow

Build docs developers (and LLMs) love

Python API

CLI usage

Required arguments

Input field extraction rules

Output format

Example workflow