tokenize() function reference

tokenize is a thin wrapper around Pyserini’s default Lucene Analyzer. It converts raw text into a space-joined string of tokens that is consistent across both indexing and querying, ensuring that the same normalization pipeline applies to documents and queries alike.

from quackir.analysis import tokenize

Signature

def tokenize(to_tokenize: str) -> str

Parameters

to_tokenize

string

required

The raw text string to analyze and tokenize.

Return value

return

string

A space-joined string of tokens produced by Pyserini’s default Lucene Analyzer (the analyzer returned by get_lucene_analyzer()). The analyzer applies lowercasing, stopword removal, and Porter stemming.

Implementation

The module-level analyzer object is created once at import time:

from pyserini.analysis import Analyzer, get_lucene_analyzer

analyzer = Analyzer(get_lucene_analyzer())

def tokenize(to_tokenize: str) -> str:
    return ' '.join(analyzer.analyze(to_tokenize))

get_lucene_analyzer() returns Pyserini’s default English analyzer, which performs:

Lowercasing
Stopword removal (English stopword list)
Porter stemming

When it is called automatically

You do not need to call tokenize directly in most workflows. QuackIR calls it internally in the following places:

Location	When
`DuckDBIndexer.load_jsonl_table`	`index_type == IndexType.SPARSE` and `pretokenized=False`
`SQLiteIndexer.load_jsonl_table`	`index_type == IndexType.SPARSE` and `pretokenized=False`
`PostgresIndexer.load_jsonl_table`	`index_type == IndexType.SPARSE` and `pretokenized=False`
`Searcher.search`	`method != SearchType.DENSE` and `tokenize_query=True`

Pass pretokenized=True to indexers or tokenize_query=False to searchers to skip automatic tokenization when your data is already tokenized.

Example

from quackir.analysis import tokenize

raw = "The quick brown foxes are jumping over lazy dogs"
tokens = tokenize(raw)
print(tokens)
# 'quick brown fox jump lazi dog'

Use tokenize directly when you want to pre-process a corpus offline and store already-tokenized text. This avoids redundant tokenization at index load time when you later pass pretokenized=True.

import json
from quackir.analysis import tokenize
from quackir import IndexType
from quackir.index import DuckDBIndexer

# Pre-tokenize and write a new JSONL file
with open("corpus.jsonl") as fin, open("tokenized.jsonl", "w") as fout:
    for line in fin:
        doc = json.loads(line)
        doc["contents"] = tokenize(doc["contents"])
        fout.write(json.dumps(doc) + "\n")

# Load the pre-tokenized file, skipping the tokenization step
indexer = DuckDBIndexer("my_index.db")
indexer.init_table("corpus", IndexType.SPARSE)
indexer.load_table("corpus", "tokenized.jsonl", pretokenized=True)
indexer.fts_index("corpus")
indexer.close()

Core

Indexers

Searchers

Analysis

tokenize() function reference

Signature

Parameters

Return value

Implementation

When it is called automatically

Example

Build docs developers (and LLMs) love

Core

Indexers

Searchers

Analysis

Documentation Index

​Signature

​Parameters

​Return value

​Implementation

​When it is called automatically

​Example

Build docs developers (and LLMs) love

Signature

Parameters

Return value

Implementation

When it is called automatically

Example