IndexWriter buffer tuning
IndexWriterConfig controls when buffered documents are flushed to a new segment on disk.
RAMPerThreadHardLimitMB (default: 1945 MB) acts as a safety valve that triggers a forced flush on a single indexing thread before it exhausts 32-bit internal addressing. You generally do not need to change it.
TieredMergePolicy
TieredMergePolicy is the default merge policy. It merges segments of approximately equal byte size, subject to a budget of allowed segments per tier.
TieredMergePolicy can merge non-adjacent segments. If monotonically increasing doc IDs are required (e.g. for time-ordered data), use LogByteSizeMergePolicy or LogDocMergePolicy instead.forceMerge vs. natural merging
CallingIndexWriter.forceMerge(maxNumSegments) bypasses the setMaxMergedSegmentMB constraint when the two settings conflict. For example, if you have fifty 1 GB segments and call forceMerge(5), Lucene will produce five segments of up to ~12 GB each to satisfy the segment count target.
ConcurrentMergeScheduler
ConcurrentMergeScheduler (the default) runs each merge in its own background thread.
AUTO_DETECT_MERGES_AND_THREADS (the default) sets maxThreadCount to max(1, min(4, cpuCoreCount/2)) and maxMergeCount to maxThreadCount + 5. On a machine with a spinning disk pass setDefaultMaxMergesAndThreads(false) to use more conservative defaults.
Directory choice
TheDirectory implementation determines how Lucene reads index files. Choose based on your operating system and access pattern.
- MMapDirectory (recommended)
- FSDirectory (auto-select)
- NIOFSDirectory
Uses On Linux and macOS, Lucene invokes For classpath-based applications use:
mmap for reads and delegates to FSIndexOutput for writes. The OS page cache is leveraged directly, avoiding an extra copy from kernel to JVM heap. This is the best choice on 64-bit JVMs with sufficient virtual address space.madvise() to advise the kernel on paging behaviour. To enable native access in a modularised application:MMapDirectory uses java.lang.foreign.MemorySegment (available since Java 21) to safely unmap files immediately after closing an IndexInput, preventing the stale-mapping bugs common with the legacy sun.misc.Unsafe approach.Preloading files into memory
MMapDirectory can preload selected files into physical memory immediately on open, reducing cold-cache latency at the cost of slower startup:
Query caching with LRUQueryCache
IndexSearcher uses an LRUQueryCache by default. You can replace or configure it to suit your workload:
The cache works best when shared across multiple searcher instances on the same index. Cache eviction runs in linear time with the number of segments that have cache entries, so avoid sharing one cache across many unrelated indices.
NRT search with SearcherManager
Near-real-time (NRT) search makes newly indexed documents visible without a fullcommit(). SearcherManager handles the lifecycle of IndexSearcher instances across multiple threads and periodic refreshes.
NRT vs. commit-based search tradeoffs
NRT (DirectoryReader.open(writer))
New documents visible within milliseconds of indexing. No I/O required for a refresh — only in-memory segment state is exchanged. Best for interactive or near-real-time workloads.
Commit-based (DirectoryReader.open(directory))
Reader only sees fully committed, durable segments. Required when the reader and writer are in separate processes or JVMs. Each refresh requires opening a new reader from disk.
maybeRefresh() on a hot path (e.g. before every query) adds latency to the requests that need to reopen the reader. Prefer calling it from a dedicated background thread on a fixed schedule.
DocValues vs. stored fields
Use DocValues for sorting and aggregation
Use DocValues for sorting and aggregation
Doc values are stored in a columnar format optimised for sequential access. Sorting, faceting, and field collapsing over doc values are significantly faster than loading the same values from stored fields, which require random access per document.
NumericDocValues vs. stored fields for numeric data
NumericDocValues vs. stored fields for numeric data
NumericDocValuesField stores one long per document in a dense array. Access is O(1) and the data is often already in the OS page cache after the first scan. Stored fields are compressed blocks that must be decompressed and decoded to retrieve a single value.Use NumericDocValuesField whenever you need to:- Sort search results by a numeric field
- Compute aggregates (sum, min, max) across result sets
- Use the value as a scoring factor in a custom
SimilarityorFunctionQuery
StoredField for values that the application needs to display verbatim and that are not used for sorting or scoring.SortedNumericDocValues for multi-valued numeric fields
SortedNumericDocValues for multi-valued numeric fields
When a document may have multiple values for one numeric field, use
SortedNumericDocValuesField. It stores all values for a document in sorted order and supports min/max selection at query time.Stored fields compression
Lucene104Codec supports two compression modes for stored fields, selected at index-creation time:
BEST_COMPRESSION is a good choice when I/O bandwidth is scarce (e.g. network-attached storage or HDDs) and CPU is plentiful.
Summary checklist
Indexing throughput
Indexing throughput
- Increase
RAMBufferSizeMB(e.g. 256–512 MB) to produce larger initial segments. - Use
ConcurrentMergeScheduler.setMaxMergesAndThreads()to match your core count. - Disable
setUseCompoundFile(false)for batch indexing to avoid extra I/O during the compound-file creation step.
Search latency
Search latency
- Use
MMapDirectoryso the OS page cache serves reads without JVM heap pressure. - Keep the segment count low with a well-tuned
TieredMergePolicy. - Warm the
SearcherManagerafter each refresh usingSearcherFactory. - Size
LRUQueryCacheto cover your hot query set.
Index size
Index size
- Switch to
Lucene104Codec.Mode.BEST_COMPRESSION. - Lower
TieredMergePolicy.setDeletesPctAllowed()to reclaim deleted-doc space sooner. - Avoid storing fields that are only needed for sorting — use
NumericDocValuesFieldinstead.