Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/castorini/quackir/llms.txt

Use this file to discover all available pages before exploring further.

tokenize is a thin wrapper around Pyserini’s default Lucene Analyzer. It converts raw text into a space-joined string of tokens that is consistent across both indexing and querying, ensuring that the same normalization pipeline applies to documents and queries alike.
from quackir.analysis import tokenize

Signature

def tokenize(to_tokenize: str) -> str

Parameters

to_tokenize
string
required
The raw text string to analyze and tokenize.

Return value

return
string
A space-joined string of tokens produced by Pyserini’s default Lucene Analyzer (the analyzer returned by get_lucene_analyzer()). The analyzer applies lowercasing, stopword removal, and Porter stemming.

Implementation

The module-level analyzer object is created once at import time:
from pyserini.analysis import Analyzer, get_lucene_analyzer

analyzer = Analyzer(get_lucene_analyzer())

def tokenize(to_tokenize: str) -> str:
    return ' '.join(analyzer.analyze(to_tokenize))
get_lucene_analyzer() returns Pyserini’s default English analyzer, which performs:
  • Lowercasing
  • Stopword removal (English stopword list)
  • Porter stemming

When it is called automatically

You do not need to call tokenize directly in most workflows. QuackIR calls it internally in the following places:
LocationWhen
DuckDBIndexer.load_jsonl_tableindex_type == IndexType.SPARSE and pretokenized=False
SQLiteIndexer.load_jsonl_tableindex_type == IndexType.SPARSE and pretokenized=False
PostgresIndexer.load_jsonl_tableindex_type == IndexType.SPARSE and pretokenized=False
Searcher.searchmethod != SearchType.DENSE and tokenize_query=True
Pass pretokenized=True to indexers or tokenize_query=False to searchers to skip automatic tokenization when your data is already tokenized.

Example

from quackir.analysis import tokenize

raw = "The quick brown foxes are jumping over lazy dogs"
tokens = tokenize(raw)
print(tokens)
# 'quick brown fox jump lazi dog'
Use tokenize directly when you want to pre-process a corpus offline and store already-tokenized text. This avoids redundant tokenization at index load time when you later pass pretokenized=True.
import json
from quackir.analysis import tokenize
from quackir import IndexType
from quackir.index import DuckDBIndexer

# Pre-tokenize and write a new JSONL file
with open("corpus.jsonl") as fin, open("tokenized.jsonl", "w") as fout:
    for line in fin:
        doc = json.loads(line)
        doc["contents"] = tokenize(doc["contents"])
        fout.write(json.dumps(doc) + "\n")

# Load the pre-tokenized file, skipping the tokenization step
indexer = DuckDBIndexer("my_index.db")
indexer.init_table("corpus", IndexType.SPARSE)
indexer.load_table("corpus", "tokenized.jsonl", pretokenized=True)
indexer.fts_index("corpus")
indexer.close()

Build docs developers (and LLMs) love