The read/write split
Lucene enforces a strict separation between writing to an index and reading from it. This is fundamental to how the library works.| Concern | Class | Notes |
|---|---|---|
| Write | IndexWriter | Creates, updates, and deletes documents. Holds a write lock on the Directory. Only one IndexWriter may be open per directory at a time. |
| Read | DirectoryReader | Opens the index read-only. A single reader instance represents a consistent point-in-time snapshot. |
| Search | IndexSearcher | Wraps a DirectoryReader and executes Query objects against it. Instances are fully thread-safe. |
IndexSearcher instances are completely thread-safe. For performance, share a single IndexSearcher across all threads rather than creating one per request. If the index changes, use DirectoryReader.openIfChanged(DirectoryReader) to get a refreshed reader cheaply.Index
An index is the on-disk (or in-memory) data structure that Lucene builds from your documents. It lives inside aDirectory — an abstraction over a file system path or a byte buffer.
FSDirectory(and its subclassNIOFSDirectory) — writes index files to a directory on disk.ByteBuffersDirectory— stores everything in JVM heap memory. Useful for tests and short-lived indexes.
- An inverted index mapping each term to the list of documents containing it
- Stored field values for fields indexed with
Field.Store.YES - Doc values for numeric fields used in sorting and faceting
- Norms used during relevance scoring
Segment
Internally, Lucene never modifies existing data. Instead, each flush fromIndexWriter produces a new, immutable segment. A segment is a self-contained mini-index with its own inverted index, stored fields, and deletion bitset.
Over time, Lucene’s merge policy (by default TieredMergePolicy) combines small segments into larger ones in the background. Merging keeps query performance high and reclaims space from deleted documents.
Document
A document is the unit of indexing and retrieval in Lucene. It is an unordered collection of fields. There is no schema — every document can have different fields.Field.Store.YES.
Field
A field associates a name with a value. The type of field determines how Lucene indexes and stores the value.TextField — full-text, analyzed
TextField — full-text, analyzed
Use for human-readable text that should be tokenized and searched with terms. The configured Good for: article bodies, product descriptions, commit messages.
Analyzer runs on the value before indexing.StringField — exact match, not analyzed
StringField — exact match, not analyzed
Use for identifiers, categories, paths, and other values that should match exactly. The analyzer is never applied; the whole string becomes a single term.Good for: file paths, UUIDs, status codes, tags.
KeywordField — exact match with doc values
KeywordField — exact match with doc values
Similar to Good for: fields you need to both filter on exactly and sort/facet by.
StringField but also writes doc values, enabling efficient sorting and faceting in addition to exact-match search.LongField — numeric range and sorting
LongField — numeric range and sorting
Indexes a Good for: timestamps, prices, counts, version numbers.
long value using a point-based data structure. Supports efficient range queries via LongField.newRangeQuery and sorting via LongField.newSortField.KnnFloatVectorField — dense vector for similarity search
KnnFloatVectorField — dense vector for similarity search
Stores a float array and builds an HNSW graph for approximate nearest-neighbor search. Requires specifying a Good for: semantic similarity, embedding-based retrieval, hybrid search.
VectorSimilarityFunction.Term
A term is the atomic unit of search. It is a(field, value) pair — for example, (title, "lucene") or (path, "docs/index.md").
When Lucene analyzes a TextField value, it produces a stream of terms. The string "Apache Lucene is fast" with StandardAnalyzer becomes the terms apache, lucene, fast (stop word “is” is removed).
Inverted index
The inverted index is the data structure at the heart of Lucene. Rather than mapping documents to the words they contain, it maps each word to the documents that contain it. Conceptual structure"lucene search", Lucene intersects the postings lists for lucene and search, finding doc 0 and doc 4. BM25 then scores each matching document based on term frequency and inverse document frequency.
Why this is fast: looking up a term requires a single seek into a sorted structure (a finite-state automaton in Lucene’s case), returning a compressed list of matching document IDs. Scanning every document’s content is never necessary.
Analysis
Analysis is the process of converting raw text into a stream of terms suitable for indexing or querying. AnAnalyzer chains together a Tokenizer and zero or more TokenFilters.
StandardAnalyzer applies:
- Unicode-aware tokenization (splits on whitespace and punctuation)
- Lowercase filter
- Stop-word filter (removes common words like “the”, “is”, “and”)
lucene-analysis-common module provides many built-in analyzers including language-specific ones (EnglishAnalyzer, FrenchAnalyzer, etc.) and building blocks for constructing custom pipelines.
Query
AQuery describes what to find. Lucene provides many query types, all under org.apache.lucene.search.
TermQuery — exact term match
TermQuery — exact term match
Matches documents containing an exact term. This is the most basic query and the building block for higher-level queries.
BooleanQuery — combine clauses
BooleanQuery — combine clauses
Combines multiple queries using
MUST (AND), SHOULD (OR), and MUST_NOT (NOT) clauses.PhraseQuery — ordered word sequence
PhraseQuery — ordered word sequence
Matches documents where the specified terms appear in order, optionally with a configurable slop (number of intervening positions allowed).
QueryParser — parse user input
QueryParser — parse user input
Parses a query string using Lucene’s query syntax. Suitable for user-facing search boxes.Supports
AND, OR, NOT, field prefixes (title:lucene), wildcards (luce*), and phrase queries ("apache lucene").KnnFloatVectorQuery — approximate nearest neighbor
KnnFloatVectorQuery — approximate nearest neighbor
Finds the k documents whose stored float vector is closest to a query vector, using the HNSW graph built at index time.
TopDocs and results
IndexSearcher.search(Query, int) returns a TopDocs object containing the top-N matching documents sorted by score.
search and searchAfter count top hits accurately up to 1,000 matches. For result sets larger than 1,000, totalHits.value() may return a lower bound. The scoreDocs array is always accurate. To count all hits precisely, use a TotalHitCountCollectorManager.Similarity. BM25 scores a document higher when:
- The query term appears frequently in the document (term frequency)
- The term is rare across all documents (inverse document frequency)
- The document is shorter than average (length normalization)
IndexSearcher.setSimilarity(Similarity) before searching.
Putting it all together
- You feed raw text and structured data to an
IndexWriterasDocumentobjects. - The
AnalyzertokenizesTextFieldvalues into terms. IndexWriterwrites terms and stored values into aDirectoryas immutable segments.- A
DirectoryReaderopens a consistent snapshot of the index. - An
IndexSearcherexecutes aQueryagainst the reader, scoring each matching document. TopDocsreturns the highest-scoring document IDs; you load stored fields to get the original values.
Quickstart
See these concepts in action with a working code example.
Introduction
Back to the overview of what Lucene is and its architecture.