GuancheData is built to stay operational when individual nodes fail. Every layer of the pipeline — in-memory data, on-disk documents, message delivery, and HTTP routing — has an independent redundancy mechanism. No single node failure causes data loss or requires manual intervention to resume service. This page explains what each mechanism does and what actually happens when each service type stops responding.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/GuancheData/stage_3/llms.txt
Use this file to discover all available pages before exploring further.
Hazelcast data replication
The three distributed IMaps that form the core of the cluster are each configured with synchronous and asynchronous backups inHazelcastConfig.java:
| IMap | Sync backups | Async backups |
|---|---|---|
inverted-index | 2 | 1 |
bookMetadata | 2 | 1 |
datalake | 2 | 1 |
Filesystem replication
In addition to in-memory replication, the ingestion service stores book files on the local filesystem under./mnt/datalake/. The REPLICATION_FACTOR environment variable controls how many cluster nodes hold a local filesystem copy of each ingested document. Cross-node replication is coordinated by HazelcastReplicationExecuter and HazelcastDatalakeListener, which distribute file copies to the appropriate number of nodes after each ingestion event.
If a node holding filesystem copies fails, the surviving copies on other nodes remain available for the InvertedIndexRecovery process to read during a startup or manual rebuild.
ActiveMQ message delivery guarantees
Two queues and topics use different durability strategies:documents.ingested queue — Indexing consumers use CLIENT_ACKNOWLEDGE mode. The broker holds the message until the indexing node explicitly acknowledges it. If the node crashes after receiving the message but before completing indexing, the broker re-delivers the message to another available indexing node. No indexed document is silently lost due to a mid-processing failure.
ingestion.control topic — Ingestion pause and resume commands are sent as durable topic messages. Durable subscribers retain messages while they are offline, so a node that restarts after a INGESTION_PAUSE event was broadcast will still receive the command and apply the correct state when it reconnects.
Nginx failover
The Nginx load balancer in front of the search tier is configured with passive health checking:fail_timeout expires, Nginx probes it again. No configuration change or restart is needed — traffic silently shifts to the remaining healthy backends.
What happens when each service fails
Ingestion node fails
Ingestion node fails
In-memory datalake entries and book metadata held by that node are covered by the 2 sync + 1 async backup policy, so no IMap data is lost. Hazelcast rebalances affected partitions to surviving members within seconds.Any documents that were being crawled at the moment of failure are not automatically retried — the ingestion service pulls book IDs from the
"books" IQueue, and a consumed ID is removed from the queue. If the node crashed before writing the file and publishing to documents.ingested, that book ID is dropped. To recover those IDs, trigger a manual index rebuild, which resets queueInitialized and re-populates the queue from the highest recovered book ID.Because multiple ingestion nodes can run in parallel, surviving nodes continue crawling without interruption. Throughput drops proportionally to the number of nodes lost.Indexing node fails
Indexing node fails
Any
documents.ingested message that was received but not yet acknowledged is re-delivered by the broker to another indexing node (due to CLIENT_ACKNOWLEDGE mode). The document will be indexed by a different node — no token data is lost.The Hazelcast "inverted-index", "bookMetadata", and "datalake" IMaps are unaffected because they are replicated. Hazelcast promotes backup partitions and rebalances.The INDEXING_BUFFER_FACTOR back-pressure threshold drops because there is now one fewer node with the role=indexer attribute. If ingestion was not already paused, it may now pause sooner. Restore full capacity by starting a replacement indexing node.If a rebuild is in progress when the node fails, the CP CountDownLatch will never reach zero (the failed node cannot count down). The coordinator will wait up to 1 hour before logging a timeout error. Ingestion will remain paused for that duration unless manually resumed.Search node fails
Search node fails
Nginx detects the failure passively after
max_fails=10 connection errors within fail_timeout=30s. Until that threshold is reached, a small number of requests may receive errors. After the threshold, Nginx stops routing to the failed node and distributes all traffic across surviving search nodes.Search nodes hold no writable state — they read from the shared inverted-index IMap. A failed search node has no impact on the accuracy or consistency of search results served by other nodes. When the node recovers, Nginx automatically re-adds it to the rotation after the fail_timeout window.ActiveMQ broker fails
ActiveMQ broker fails
This is the most impactful single-node failure. The broker is a central component: ingestion nodes cannot publish new indexing events, and ingestion control (pause/resume) commands cannot be delivered.Indexing nodes that already have unconsumed messages in their local consumer buffer will continue processing until that buffer is exhausted. Once it empties, indexing stops.The broker is deployed with
restart: always in docker-compose.yml, so Docker will attempt to restart it automatically. After the broker recovers, producers and consumers re-establish their JMS connections and processing resumes. Durable subscribers on ingestion.control receive any control messages that were sent while they were disconnected.For production workloads, consider running ActiveMQ in a highly-available configuration (e.g., using a shared store or network of brokers) to eliminate this as a single point of failure.