Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/GuancheData/stage_3/llms.txt

Use this file to discover all available pages before exploring further.

GuancheData is designed to scale horizontally at each layer of the pipeline. Ingestion, indexing, and search nodes are independent Docker Compose services that can each be multiplied without changing the rest of the cluster. New nodes discover the Hazelcast cluster through a configured seed address, automatically receive their share of the distributed IMaps (inverted-index, bookMetadata, datalake), and begin serving traffic immediately. Scaling any tier requires updating one environment variable and, in the case of search nodes, one line in nginx.conf.

How Hazelcast cluster membership works

Every service node (ingestion, indexing, search, and benchmark) joins the same Hazelcast cluster named SearchEngine. Cluster formation uses TCP/IP discovery — no multicast. Each node’s HZ_MEMBERS environment variable points to the seed node, and from there Hazelcast builds a full member list automatically. The three distributed data structures shared across all nodes are:
IMapPurpose
inverted-indexToken → book-ID posting lists
bookMetadataBook metadata records
datalakeIn-memory datalake references
All three are sharded and replicated automatically. Adding nodes causes Hazelcast to rebalance partitions so the new member holds its share of data.

INDEXING_BUFFER_FACTOR and back-pressure

The ingestion service uses INDEXING_BUFFER_FACTOR to avoid flooding indexing nodes with more work than they can handle. Before emitting a new batch of indexing events, the ingestion service checks:
datalakeSize < INDEXING_BUFFER_FACTOR × indexerCount
If the condition is false, ingestion pauses until indexers catch up. This means that adding more indexing nodes directly increases the buffer threshold — a cluster with four indexers (and INDEXING_BUFFER_FACTOR=2) tolerates a datalake of up to eight concurrent batches before pausing, compared to four for a two-node cluster. The default value in docker-compose.yml is 2. Increase it to allow deeper queuing at the cost of higher memory pressure.

Scale-out procedure

1

Update environment variables for the new node

On the machine that will host the new node, edit docker-compose.yml. Replace every xxx placeholder with the actual IP addresses:
ingestion-service:
  environment:
    HZ_PORT: "5701"
    HZ_PUBLIC_ADDRESS: <THIS_NODE_IP>:5701
    HZ_MEMBERS: <SEED_NODE_IP>:5701
    HAZELCAST_CLUSTER_NAME: SearchEngine
    BROKER_URL: tcp://<BROKER_IP>:61616
    REPLICATION_FACTOR: 2
    INDEXING_BUFFER_FACTOR: 2
Repeat the same substitution for indexing-service and search-service if you are starting all three on this node.
2

Start backend services on the new node

Run the backend profile. This starts the ingestion, indexing, and search services defined in docker-compose.yml:
docker compose --profile backend up -d
Each container will start, bind its Hazelcast port, and attempt to reach the seed member at HZ_MEMBERS. Once contact is established the node joins the cluster and partition rebalancing begins.
3

Verify cluster membership

Check the logs of any existing node to confirm the new member has joined:
docker logs indexing-service | grep "Members"
You should see the new node’s address in the member list.
4

Add the new search node to nginx.conf

If you started a search service on the new node, add it to the Nginx upstream block:
upstream search_backend {
    least_conn;

    server <EXISTING_NODE_IP>:7003 max_fails=10 fail_timeout=30s;
    server <NEW_NODE_IP>:7003     max_fails=10 fail_timeout=30s;

    keepalive 64;
}
Reload Nginx to apply the change:
docker exec nginx nginx -s reload
Every new node must have HZ_MEMBERS set to the IP of an existing, reachable cluster member. If the seed node changes (for example because the original seed was shut down), update HZ_MEMBERS on all nodes before restarting them. Hazelcast does not use dynamic DNS — the seed address must be a stable IP or hostname.

Scaling each service type independently

You are not required to run all three services on every node. Each service type can be scaled on its own: Ingestion nodes — Add nodes to increase parallel crawling throughput. Each ingestion node is independently capped by INDEXING_BUFFER_FACTOR × indexerCount, so adding ingestion nodes without adding indexing nodes provides diminishing returns once the buffer is saturated. Indexing nodes — Adding indexers directly raises the back-pressure threshold and speeds up token processing. The indexing service automatically receives the role=indexer member attribute, which lets CoordinateRebuild count active indexers when sizing the CP CountDownLatch during a full index rebuild. Search nodes — Search nodes read from the shared inverted-index IMap and are stateless with respect to write operations. Any number can be added; simply register each one in nginx.conf. Nginx distributes requests using least_conn and skips nodes that exceed max_fails=10 within a fail_timeout=30s window.

Build docs developers (and LLMs) love