The indexing service transforms raw book content into a searchable inverted index. It listens on the ActiveMQ queueDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/GuancheData/stage_3/llms.txt
Use this file to discover all available pages before exploring further.
documents.ingested, reads each book’s BookContent from the Hazelcast "datalake" IMap, runs term frequency analysis, and writes the results into the "inverted-index" IMap. Metadata extracted from the Gutenberg header (title, author, language, year) is stored separately in "bookMetadata". On startup the service walks the local filesystem datalake and re-indexes any books that are missing from the index, providing crash recovery without operator intervention.
Indexing pipeline
Consume message
ActiveMQMessageConsumer reads a {"bookId": N} message from documents.ingested using CLIENT_ACKNOWLEDGE. The message is acknowledged only after the full indexing flow completes successfully.Read BookContent from Hazelcast
HazelcastBookStore retrieves the BookContent (header and body strings) from the "datalake" IMap using the book ID as the key.Run TermFrequencyAnalyzer
The body text is passed to
TermFrequencyAnalyzer.analyze(), which delegates to TextTokenizer and then uses a parallel stream with groupingByConcurrent to count occurrences of each token.Write to inverted index
For every term–frequency pair,
HazelcastIndexStore.addEntry() buffers the entry in a local ConcurrentHashMap. When the batch is complete, pushEntries() merges the local buffer into the distributed "inverted-index" IMap in parallel using IMap.merge().Tokenization
TextTokenizer applies the following transformations in order:
- Lowercase — the entire input is lowercased.
- Strip non-alphanumeric characters — everything that is not
[a-z0-9\s]is replaced with a space. - Split on whitespace — the cleaned string is split on one or more whitespace characters.
- Drop short tokens — tokens with length ≤ 2 characters are discarded.
- Remove stopwords — tokens present in the loaded stopword set are dropped.
stopwords-iso.json at startup by JsonStopWordsLoader, which merges the English and Spanish word lists into a single Set<String>.
Inverted index entry format
Each key in the"inverted-index" IMap is a lowercase term string. Its value is a Set<String> where every element encodes a document ID and term frequency:
docId:frequency, for example "84:7" means book 84 contains the term 7 times. The set grows atomically via IMap.merge() so concurrent indexer nodes can write to the same key without overwriting each other’s data.
Startup recovery
On every startup,InvertedIndexRecovery.executeRecovery() walks the datalake filesystem tree and finds all *_body.txt files. For each file it:
- Extracts the book ID from the filename (the integer prefix before
_body.txt). - Reads the corresponding
*_header.txtfile from the same directory. - Saves the
BookContentback to the"datalake"IMap viaBookStore.save(). - Calls
indexBook.execute(bookId)to (re-)index the book.
Rebuild coordination
A rebuild can be triggered through the HTTP API (POST /index/rebuild). CoordinateRebuild publishes a {"epoch": <timestampMs>} command to the ActiveMQ topic index.rebuild.command. RebuildMessageListener on every indexing node subscribes to this topic and delegates to ReindexingExecutor, which clears the "inverted-index", "bookMetadata", and "indexingRegistry" distributed structures and re-runs InvertedIndexRecovery. Completion is coordinated via a Hazelcast CP CountDownLatch named "rebuild-latch".
HTTP endpoints
| Method | Path | Description |
|---|---|---|
POST | /index/document/{documentId} | Manually triggers indexing of a single document by integer ID. Returns 400 if the ID is not a valid integer. |
POST | /index/rebuild | Coordinates a full index rebuild across all cluster nodes via the Hazelcast topic. |
GET | /health | Returns {"status": "healthy"}. |