TextField is added to a document) and query time (when a query string is parsed). The same analyzer must produce the same terms at both ends, or queries will not match the indexed content.
What an Analyzer does
AnAnalyzer builds a TokenStream from input text. The stream passes through a pipeline:
- Tokenizer — splits raw text into an initial sequence of tokens (e.g. splitting on whitespace or Unicode word boundaries).
- TokenFilter chain — each filter transforms the token stream: lowercasing, removing stop words, stemming, synonym expansion, etc.
createComponents(String fieldName) to define this pipeline:
fieldName parameter lets you vary the pipeline per field within a single anonymous Analyzer, though PerFieldAnalyzerWrapper is the cleaner approach for most cases.
Built-in analyzers
StandardAnalyzer
Tokenizes on Unicode text segmentation boundaries, lowercases, and removes a configurable stop word list. Good general-purpose default. Used by
IndexWriterConfig() when no analyzer is supplied.WhitespaceAnalyzer
Splits only on whitespace. Does not lowercase or remove stop words. Useful when case and punctuation are significant.
KeywordAnalyzer
Treats the entire input as a single token. Equivalent to
StringField behavior but applied at the analyzer level. Useful in PerFieldAnalyzerWrapper for ID or category fields.lucene-core. Additional analyzers are in the lucene-analysis-common module.
Common token filters
| Filter | Package | Effect |
|---|---|---|
LowerCaseFilter | lucene-core | Lowercases every token. Almost always the first filter after tokenization. |
StopFilter | lucene-analysis-common | Removes tokens that appear in a configurable stop word set (e.g. “the”, “a”, “is”). |
PorterStemFilter | lucene-analysis-common | Applies the Porter stemming algorithm to reduce words to their root form (“running” → “run”). |
EnglishPossessiveFilter | lucene-analysis-common | Strips possessive 's from tokens. |
ASCIIFoldingFilter | lucene-analysis-common | Converts accented characters to their ASCII equivalents (“café” → “cafe”). |
SynonymGraphFilter | lucene-analysis-common | Expands tokens with configured synonyms at index or query time. |
Language analyzers in lucene-analysis-common
Thelucene-analysis-common module ships pre-built analyzers for many languages. They combine a language-appropriate tokenizer, stop word list, and stemmer.
EnglishAnalyzer
EnglishAnalyzer
Combines
StandardTokenizer, LowerCaseFilter, StopFilter with English stop words, EnglishPossessiveFilter, and PorterStemFilter.FrenchAnalyzer
FrenchAnalyzer
Uses
StandardTokenizer with French stop words, ElisionFilter (removes elided articles like “l’”), and FrenchLightStemFilter.GermanAnalyzer
GermanAnalyzer
Uses
StandardTokenizer with German stop words, GermanNormalizationFilter, and GermanLightStemFilter.CJKAnalyzer
CJKAnalyzer
Handles Chinese, Japanese, and Korean text using bigram tokenization and CJK-specific stop words.
Per-field analysis with PerFieldAnalyzerWrapper
Different fields often need different analysis strategies. Atitle field might use EnglishAnalyzer, while a category field should use KeywordAnalyzer to preserve exact values.
PerFieldAnalyzerWrapper (in lucene-analysis-common) routes each field to its own analyzer, falling back to a default for any field not explicitly mapped:
PerFieldAnalyzerWrapper can be used for both indexing and query parsing. Pass the same wrapper to your QueryParser so that the query-time analysis matches index-time analysis for each field.Analysis affects both indexing and querying
When you index aTextField, Lucene runs the field value through the analyzer and stores the resulting tokens. When you search that field using a text query, the query string is analyzed the same way before matching against the index.
StringField and KeywordField are not analyzed at index time. Do not use an analyzing query parser with those field types — use TermQuery or KeywordField.newExactQuery() directly.