GuancheData is designed to scale horizontally at each layer of the pipeline. Ingestion, indexing, and search nodes are independent Docker Compose services that can each be multiplied without changing the rest of the cluster. New nodes discover the Hazelcast cluster through a configured seed address, automatically receive their share of the distributed IMaps (Documentation Index
Fetch the complete documentation index at: https://mintlify.com/GuancheData/stage_3/llms.txt
Use this file to discover all available pages before exploring further.
inverted-index, bookMetadata, datalake), and begin serving traffic immediately. Scaling any tier requires updating one environment variable and, in the case of search nodes, one line in nginx.conf.
How Hazelcast cluster membership works
Every service node (ingestion, indexing, search, and benchmark) joins the same Hazelcast cluster namedSearchEngine. Cluster formation uses TCP/IP discovery — no multicast. Each node’s HZ_MEMBERS environment variable points to the seed node, and from there Hazelcast builds a full member list automatically.
The three distributed data structures shared across all nodes are:
| IMap | Purpose |
|---|---|
inverted-index | Token → book-ID posting lists |
bookMetadata | Book metadata records |
datalake | In-memory datalake references |
INDEXING_BUFFER_FACTOR and back-pressure
The ingestion service usesINDEXING_BUFFER_FACTOR to avoid flooding indexing nodes with more work than they can handle. Before emitting a new batch of indexing events, the ingestion service checks:
INDEXING_BUFFER_FACTOR=2) tolerates a datalake of up to eight concurrent batches before pausing, compared to four for a two-node cluster.
The default value in docker-compose.yml is 2. Increase it to allow deeper queuing at the cost of higher memory pressure.
Scale-out procedure
Update environment variables for the new node
On the machine that will host the new node, edit Repeat the same substitution for
docker-compose.yml. Replace every xxx placeholder with the actual IP addresses:indexing-service and search-service if you are starting all three on this node.Start backend services on the new node
Run the backend profile. This starts the ingestion, indexing, and search services defined in Each container will start, bind its Hazelcast port, and attempt to reach the seed member at
docker-compose.yml:HZ_MEMBERS. Once contact is established the node joins the cluster and partition rebalancing begins.Verify cluster membership
Check the logs of any existing node to confirm the new member has joined:You should see the new node’s address in the member list.
Every new node must have
HZ_MEMBERS set to the IP of an existing, reachable cluster member. If the seed node changes (for example because the original seed was shut down), update HZ_MEMBERS on all nodes before restarting them. Hazelcast does not use dynamic DNS — the seed address must be a stable IP or hostname.Scaling each service type independently
You are not required to run all three services on every node. Each service type can be scaled on its own: Ingestion nodes — Add nodes to increase parallel crawling throughput. Each ingestion node is independently capped byINDEXING_BUFFER_FACTOR × indexerCount, so adding ingestion nodes without adding indexing nodes provides diminishing returns once the buffer is saturated.
Indexing nodes — Adding indexers directly raises the back-pressure threshold and speeds up token processing. The indexing service automatically receives the role=indexer member attribute, which lets CoordinateRebuild count active indexers when sizing the CP CountDownLatch during a full index rebuild.
Search nodes — Search nodes read from the shared inverted-index IMap and are stateless with respect to write operations. Any number can be added; simply register each one in nginx.conf. Nginx distributes requests using least_conn and skips nodes that exceed max_fails=10 within a fail_timeout=30s window.