Fault tolerance and failure recovery in GuancheData

GuancheData is built to stay operational when individual nodes fail. Every layer of the pipeline — in-memory data, on-disk documents, message delivery, and HTTP routing — has an independent redundancy mechanism. No single node failure causes data loss or requires manual intervention to resume service. This page explains what each mechanism does and what actually happens when each service type stops responding.

Hazelcast data replication

The three distributed IMaps that form the core of the cluster are each configured with synchronous and asynchronous backups in HazelcastConfig.java:

IMap	Sync backups	Async backups
`inverted-index`	2	1
`bookMetadata`	2	1
`datalake`	2	1

Synchronous backups block the write until all backup copies confirm receipt, ensuring zero data loss for acknowledged writes. The single asynchronous backup improves write throughput for a third replica that trails slightly behind. When a member leaves the cluster (gracefully or by crash), Hazelcast detects the departure through its cluster heartbeat, promotes the existing backup partitions on surviving members to primary, and creates new backup copies on available nodes to restore the target replication depth.

Filesystem replication

In addition to in-memory replication, the ingestion service stores book files on the local filesystem under ./mnt/datalake/. The REPLICATION_FACTOR environment variable controls how many cluster nodes hold a local filesystem copy of each ingested document. Cross-node replication is coordinated by HazelcastReplicationExecuter and HazelcastDatalakeListener, which distribute file copies to the appropriate number of nodes after each ingestion event. If a node holding filesystem copies fails, the surviving copies on other nodes remain available for the InvertedIndexRecovery process to read during a startup or manual rebuild.

Set REPLICATION_FACTOR to at least 2 in production. A value of 1 means each book file exists on only one node, so a single disk failure can make documents permanently unrecoverable without an external backup. For clusters with five or more nodes, REPLICATION_FACTOR: 3 provides a good balance of redundancy and storage overhead. The default in docker-compose.yml is 2.

ActiveMQ message delivery guarantees

Two queues and topics use different durability strategies: documents.ingested queue — Indexing consumers use CLIENT_ACKNOWLEDGE mode. The broker holds the message until the indexing node explicitly acknowledges it. If the node crashes after receiving the message but before completing indexing, the broker re-delivers the message to another available indexing node. No indexed document is silently lost due to a mid-processing failure. ingestion.control topic — Ingestion pause and resume commands are sent as durable topic messages. Durable subscribers retain messages while they are offline, so a node that restarts after a INGESTION_PAUSE event was broadcast will still receive the command and apply the correct state when it reconnects.

Nginx failover

The Nginx load balancer in front of the search tier is configured with passive health checking:

upstream search_backend {
    least_conn;

    server <NODE_IP>:7003 max_fails=10 fail_timeout=30s;

    keepalive 64;
}

If a search node fails to respond 10 consecutive times within a 30-second window, Nginx marks it as unavailable and stops routing requests to it. Once the fail_timeout expires, Nginx probes it again. No configuration change or restart is needed — traffic silently shifts to the remaining healthy backends.

What happens when each service fails

Ingestion node fails

In-memory datalake entries and book metadata held by that node are covered by the 2 sync + 1 async backup policy, so no IMap data is lost. Hazelcast rebalances affected partitions to surviving members within seconds.Any documents that were being crawled at the moment of failure are not automatically retried — the ingestion service pulls book IDs from the "books" IQueue, and a consumed ID is removed from the queue. If the node crashed before writing the file and publishing to documents.ingested, that book ID is dropped. To recover those IDs, trigger a manual index rebuild, which resets queueInitialized and re-populates the queue from the highest recovered book ID.Because multiple ingestion nodes can run in parallel, surviving nodes continue crawling without interruption. Throughput drops proportionally to the number of nodes lost.

Indexing node fails

Any documents.ingested message that was received but not yet acknowledged is re-delivered by the broker to another indexing node (due to CLIENT_ACKNOWLEDGE mode). The document will be indexed by a different node — no token data is lost.The Hazelcast "inverted-index", "bookMetadata", and "datalake" IMaps are unaffected because they are replicated. Hazelcast promotes backup partitions and rebalances.The INDEXING_BUFFER_FACTOR back-pressure threshold drops because there is now one fewer node with the role=indexer attribute. If ingestion was not already paused, it may now pause sooner. Restore full capacity by starting a replacement indexing node.If a rebuild is in progress when the node fails, the CP CountDownLatch will never reach zero (the failed node cannot count down). The coordinator will wait up to 1 hour before logging a timeout error. Ingestion will remain paused for that duration unless manually resumed.

Search node fails

Nginx detects the failure passively after max_fails=10 connection errors within fail_timeout=30s. Until that threshold is reached, a small number of requests may receive errors. After the threshold, Nginx stops routing to the failed node and distributes all traffic across surviving search nodes.Search nodes hold no writable state — they read from the shared inverted-index IMap. A failed search node has no impact on the accuracy or consistency of search results served by other nodes. When the node recovers, Nginx automatically re-adds it to the rotation after the fail_timeout window.

ActiveMQ broker fails

This is the most impactful single-node failure. The broker is a central component: ingestion nodes cannot publish new indexing events, and ingestion control (pause/resume) commands cannot be delivered.Indexing nodes that already have unconsumed messages in their local consumer buffer will continue processing until that buffer is exhausted. Once it empties, indexing stops.The broker is deployed with restart: always in docker-compose.yml, so Docker will attempt to restart it automatically. After the broker recovers, producers and consumers re-establish their JMS connections and processing resumes. Durable subscribers on ingestion.control receive any control messages that were sent while they were disconnected.For production workloads, consider running ActiveMQ in a highly-available configuration (e.g., using a shared store or network of brokers) to eliminate this as a single point of failure.

Overview

Getting Started

Services

Operations

Fault tolerance and failure recovery in GuancheData

Hazelcast data replication

Filesystem replication

ActiveMQ message delivery guarantees

Nginx failover

What happens when each service fails

Build docs developers (and LLMs) love

Overview

Getting Started

Services

Operations

Documentation Index

​Hazelcast data replication

​Filesystem replication

​ActiveMQ message delivery guarantees

​Nginx failover

​What happens when each service fails

Build docs developers (and LLMs) love

Hazelcast data replication

Filesystem replication

ActiveMQ message delivery guarantees

Nginx failover

What happens when each service fails