Full index rebuild for GuancheData Search Engine

A full index rebuild wipes all distributed index state and reconstructs the inverted index from the raw files stored in the datalake filesystem. You would trigger one after recovering from severe data corruption, after restoring a datalake backup that predates the current in-memory state, or after a configuration change that requires re-tokenizing all documents with a different strategy. On normal startup, the indexing service also performs an automatic recovery pass using the same underlying mechanism — no manual trigger is needed for that case.

A manual rebuild clears all distributed state before re-indexing: the inverted-index IMap, bookMetadata IMap, indexingRegistry ISet, log ISet, books IQueue, and the queueInitialized CP atomic long. All in-flight search results during the rebuild will be incomplete. Ingestion is paused for the duration. Do not trigger a rebuild unless you intend to fully reconstruct the index from disk.

Automatic startup recovery

When an indexing node starts, ReindexingExecutor.executeRecovery() is called automatically. It invokes InvertedIndexRecovery, which walks the local datalake/ filesystem, reads every {id}_header.txt and {id}_body.txt pair, saves the content back to Hazelcast, and runs the indexing use case for each book. It returns the highest book ID found, which is then used to seed the "books" IQueue so that ingestion resumes from that point forward rather than re-crawling already-stored books. This recovery runs on each node independently at startup and does not require coordination. It is not a full cluster-wide rebuild — it only indexes the files that are locally present on that node’s datalake volume.

Manual coordinated rebuild

To trigger a full coordinated rebuild across all active indexing nodes, send a POST request to any indexing service:

curl -X POST http://localhost:7002/index/rebuild

The endpoint returns immediately with a confirmation message. The actual rebuild runs asynchronously and is coordinated through the flow described below.

Rebuild flow

Pause ingestion

CoordinateRebuild.execute() counts the number of active cluster members with the role=indexer attribute, then publishes an INGESTION_PAUSE command to the ingestion.control ActiveMQ topic. All ingestion nodes receive this command (via their durable subscribers) and stop emitting new indexing events.

Size the coordination latch

A Hazelcast CP CountDownLatch named "rebuild-latch" is created (or reset) with a count equal to the number of active indexer nodes. This ensures the coordinator waits for every participating node to finish before resuming ingestion.

Broadcast the rebuild command

CoordinateRebuild publishes a {"epoch": <timestampMs>} JSON message to the ActiveMQ topic index.rebuild.command. Every indexing node receives this message via its RebuildMessageListener (each subscribes with a unique UUID-based client ID).

Each node waits for cluster sync, then clears and re-indexes

On receipt of the RebuildCommand, each RebuildMessageListener waits 10 seconds for cluster state to stabilize, then calls ReindexingExecutor.rebuildIndex(). That method:

Stops the queue population loop.
Clears the "log", "indexingRegistry", "inverted-index", "bookMetadata", and "books" distributed structures.
Resets the "queueInitialized" CP atomic long to 0.
Calls executeRecovery(), which walks the datalake filesystem and re-indexes every book found.

Once complete, the node calls countDown() on the "rebuild-latch" CP latch.

Coordinator waits for all nodes

The coordinator thread (started in step 1) blocks on latch.await(1, TimeUnit.HOURS). It waits until all indexing nodes have counted down, confirming that the full cluster has finished rebuilding.

Resume ingestion

After the latch reaches zero, the coordinator publishes INGESTION_RESUME to the ingestion.control topic. All ingestion nodes receive this and resume crawling from the highest book ID found during the rebuild.

The coordinator will wait up to 1 hour for all nodes to count down the CP latch. If a node crashes mid-rebuild, it will never count down, and the coordinator will time out after one hour with REBUILD TIMEOUT. Ingestion remains paused. logged at ERROR level. If this happens, manually resume ingestion by restarting the crashed node (which will re-execute its rebuild on startup and count down the latch), or restart the ingestion service containers directly.

How `InvertedIndexRecovery` reads the datalake

InvertedIndexRecovery walks the entire directory tree rooted at the datalake path. For each file whose name ends with _body.txt, it:

Extracts the numeric book ID from the filename.
Resolves the corresponding {id}_header.txt in the same directory.
Reads both files and saves the BookContent to Hazelcast via bookStore.save().
Calls indexBookUseCase.execute(bookId) to tokenize and write posting-list entries into the inverted-index IMap.
Tracks the maximum book ID seen across all files.

The maximum book ID is returned to ReindexingExecutor, which passes it to IngestionQueueManager.setupBookQueue() to populate the "books" IQueue starting from that ID. Books with IDs below the maximum are assumed to already be on disk and are skipped.

Overview

Getting Started

Services

Operations

Full index rebuild for GuancheData Search Engine

Automatic startup recovery

Manual coordinated rebuild

Rebuild flow

How `InvertedIndexRecovery` reads the datalake

Build docs developers (and LLMs) love

Overview

Getting Started

Services

Operations

Documentation Index

​Automatic startup recovery

​Manual coordinated rebuild

​Rebuild flow

​How InvertedIndexRecovery reads the datalake

Build docs developers (and LLMs) love

Automatic startup recovery

Manual coordinated rebuild

Rebuild flow

How `InvertedIndexRecovery` reads the datalake