Indexing Service: term frequency and inverted index

The indexing service transforms raw book content into a searchable inverted index. It listens on the ActiveMQ queue documents.ingested, reads each book’s BookContent from the Hazelcast "datalake" IMap, runs term frequency analysis, and writes the results into the "inverted-index" IMap. Metadata extracted from the Gutenberg header (title, author, language, year) is stored separately in "bookMetadata". On startup the service walks the local filesystem datalake and re-indexes any books that are missing from the index, providing crash recovery without operator intervention.

Indexing pipeline

Consume message

ActiveMQMessageConsumer reads a {"bookId": N} message from documents.ingested using CLIENT_ACKNOWLEDGE. The message is acknowledged only after the full indexing flow completes successfully.

Read BookContent from Hazelcast

HazelcastBookStore retrieves the BookContent (header and body strings) from the "datalake" IMap using the book ID as the key.

Run TermFrequencyAnalyzer

The body text is passed to TermFrequencyAnalyzer.analyze(), which delegates to TextTokenizer and then uses a parallel stream with groupingByConcurrent to count occurrences of each token.

Write to inverted index

For every term–frequency pair, HazelcastIndexStore.addEntry() buffers the entry in a local ConcurrentHashMap. When the batch is complete, pushEntries() merges the local buffer into the distributed "inverted-index" IMap in parallel using IMap.merge().

Parse and store metadata

MetadataParser scans the header section for Gutenberg fields (title, author, language, release year) and HazelcastMetadataStore writes a BookMetadata record into the "bookMetadata" IMap.

Tokenization

TextTokenizer applies the following transformations in order:

Lowercase — the entire input is lowercased.
Strip non-alphanumeric characters — everything that is not [a-z0-9\s] is replaced with a space.
Split on whitespace — the cleaned string is split on one or more whitespace characters.
Drop short tokens — tokens with length ≤ 2 characters are discarded.
Remove stopwords — tokens present in the loaded stopword set are dropped.

Steps 3–5 run in a parallel stream. Stopwords are loaded from stopwords-iso.json at startup by JsonStopWordsLoader, which merges the English and Spanish word lists into a single Set<String>.

Inverted index entry format

Each key in the "inverted-index" IMap is a lowercase term string. Its value is a Set<String> where every element encodes a document ID and term frequency:

"inverted-index"
  key:   "whale"
  value: { "2489:14", "84:7", "9147:3" }

Each entry in the set has the form docId:frequency, for example "84:7" means book 84 contains the term 7 times. The set grows atomically via IMap.merge() so concurrent indexer nodes can write to the same key without overwriting each other’s data.

Startup recovery

On every startup, InvertedIndexRecovery.executeRecovery() walks the datalake filesystem tree and finds all *_body.txt files. For each file it:

Extracts the book ID from the filename (the integer prefix before _body.txt).
Reads the corresponding *_header.txt file from the same directory.
Saves the BookContent back to the "datalake" IMap via BookStore.save().
Calls indexBook.execute(bookId) to (re-)index the book.

This means a node that joins a cluster after a crash — or a brand-new node with a pre-populated volume — will automatically catch up with all existing data before it starts consuming new ActiveMQ messages.

Rebuild coordination

A rebuild can be triggered through the HTTP API (POST /index/rebuild). CoordinateRebuild publishes a {"epoch": <timestampMs>} command to the ActiveMQ topic index.rebuild.command. RebuildMessageListener on every indexing node subscribes to this topic and delegates to ReindexingExecutor, which clears the "inverted-index", "bookMetadata", and "indexingRegistry" distributed structures and re-runs InvertedIndexRecovery. Completion is coordinated via a Hazelcast CP CountDownLatch named "rebuild-latch".

HTTP endpoints

Method	Path	Description
`POST`	`/index/document/{documentId}`	Manually triggers indexing of a single document by integer ID. Returns `400` if the ID is not a valid integer.
`POST`	`/index/rebuild`	Coordinates a full index rebuild across all cluster nodes via the Hazelcast topic.
`GET`	`/health`	Returns `{"status": "healthy"}`.

All responses are JSON. The service listens on port 7002.

Overview

Getting Started

Services

Operations

Indexing Service: term frequency and inverted index

Indexing pipeline

Tokenization

Inverted index entry format

Startup recovery

Rebuild coordination

HTTP endpoints

Build docs developers (and LLMs) love

Overview

Getting Started

Services

Operations

Documentation Index

​Indexing pipeline

​Tokenization

​Inverted index entry format

​Startup recovery

​Rebuild coordination

​HTTP endpoints

Build docs developers (and LLMs) love

Indexing pipeline

Tokenization

Inverted index entry format

Startup recovery

Rebuild coordination

HTTP endpoints