Documentation Index
Fetch the complete documentation index at: https://mintlify.com/castorini/quackir/llms.txt
Use this file to discover all available pages before exploring further.
tokenize is a thin wrapper around Pyserini’s default Lucene Analyzer. It converts raw text into a space-joined string of tokens that is consistent across both indexing and querying, ensuring that the same normalization pipeline applies to documents and queries alike.
Signature
Parameters
The raw text string to analyze and tokenize.
Return value
A space-joined string of tokens produced by Pyserini’s default Lucene Analyzer (the analyzer returned by
get_lucene_analyzer()). The analyzer applies lowercasing, stopword removal, and Porter stemming.Implementation
The module-levelanalyzer object is created once at import time:
get_lucene_analyzer() returns Pyserini’s default English analyzer, which performs:
- Lowercasing
- Stopword removal (English stopword list)
- Porter stemming
When it is called automatically
You do not need to calltokenize directly in most workflows. QuackIR calls it internally in the following places:
| Location | When |
|---|---|
DuckDBIndexer.load_jsonl_table | index_type == IndexType.SPARSE and pretokenized=False |
SQLiteIndexer.load_jsonl_table | index_type == IndexType.SPARSE and pretokenized=False |
PostgresIndexer.load_jsonl_table | index_type == IndexType.SPARSE and pretokenized=False |
Searcher.search | method != SearchType.DENSE and tokenize_query=True |
pretokenized=True to indexers or tokenize_query=False to searchers to skip automatic tokenization when your data is already tokenized.