TheDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/castorini/quackir/llms.txt
Use this file to discover all available pages before exploring further.
quackir.analysis module wraps Pyserini’s default Lucene analyzer to tokenize text for sparse indexing and retrieval. The analyzer lowercases input text, removes English stopwords, and applies Porter stemming via get_lucene_analyzer() from Pyserini. The CLI tool also handles format conversion: it reads JSONL or TSV files, extracts the relevant text fields, tokenizes them, and writes output in the {"id": ..., "contents": ...} JSONL format that quackir.index expects for sparse indexes.
Tokenization is applied automatically during sparse indexing (
quackir.index) and sparse search (quackir.search) unless the --pretokenized flag is set. You only need to run quackir.analysis directly if you want to pre-process data in a separate step, inspect the tokenized output, or reuse tokenized files across multiple indexing runs.Python API
Thetokenize function takes a string and returns the whitespace-joined token sequence produced by Pyserini’s default Lucene analyzer:
The string to tokenize. Returns the analyzer’s tokens joined with single spaces.
CLI usage
Required arguments
Path to an input file or directory. Accepted file formats:
.jsonl and .tsv (gzip-compressed variants such as .jsonl.gz and .tsv.gz are also supported). When a directory is given, every file containing .jsonl or .tsv in its name is processed; other files and subdirectories are skipped. A progress message is printed after each file.Path to the output JSONL file. Each line is written as
{"id": "<id>", "contents": "<tokenized text>"}. This format is exactly what quackir.index expects for sparse indexes and what quackir.search expects for sparse retrieval queries.Input field extraction rules
The field used as text input depends on the file type and the fields present.- JSONL
- TSV
The first field of each JSON object is taken as the identifier. The text to tokenize is selected using the following priority:
- If both
titleandtextfields are present, their values are concatenated (title + " " + text) and tokenized. All other fields are ignored. - If a
contentsfield is present, its value is tokenized. All other fields are ignored. - Otherwise, the values of all fields except the first (the id field) are concatenated and tokenized.
Output format
Every line of the output file is a JSON object with exactly two fields:This output format is exactly what
quackir.index expects for sparse indexing. Pass the output file to --input of quackir.index --index-type sparse --pretokenized to skip re-tokenization during indexing.