GuancheData Architecture: Services, Cluster, and Messaging

GuancheData is built as three cooperating Java 17 microservices — ingestion, indexing, and search — that share state through a Hazelcast in-memory cluster and coordinate asynchronously over Apache ActiveMQ. An Nginx reverse proxy fronts the search tier. Every component is stateless at the process level; all shared data lives in the distributed Hazelcast cluster, so any node can fail and the remaining nodes continue serving requests without data loss.

Service topology

The cluster is divided into three logical tiers, each independently scalable:

Tier	Service	HTTP port	Hazelcast port	Role
Ingestion	`ingestion-service`	7001	5701	Downloads books, writes datalake, publishes events
Indexing	`indexing-service`	7002	5702	Consumes events, tokenizes text, updates inverted index
Search	`search-service`	7003	5703	Reads inverted index, executes queries, returns results
Load balancer	`nginx`	8080 (→80)	—	Least-connections proxy across search instances
Message broker	`activemq`	61616 (JMS) / 8161 (Web)	—	Async coordination between ingestion and indexing

All services join the same Hazelcast cluster named SearchEngine. The ingestion-service acts as the seed node (default port 5701); other services point HZ_MEMBERS at this address to join.

Data flow

Book IDs enter the work queue

Book IDs are pushed into the Hazelcast books IQueue, either via a direct HTTP POST /ingest/{book_id} call to the ingestion API or by the ingestion service’s internal periodic scheduler, which polls the queue every 100 ms.

Ingestion service fetches from Project Gutenberg

The ingestion service dequeues a book ID, connects to Project Gutenberg, and downloads the full text. The content is written to both the local filesystem (/app/datalake) and the Hazelcast datalake IMap for in-memory access by indexers.

Notification published to ActiveMQ

After a successful download, the ingestion service immediately publishes a {"bookId": N} JSON message to the documents.ingested queue on ActiveMQ. One message is sent per book — there is no batching. The INDEXING_BUFFER_FACTOR environment variable separately controls back-pressure: if the datalake grows beyond INDEXING_BUFFER_FACTOR × indexerCount entries, the periodic scheduler pauses before fetching the next book.

Indexing service consumes the message

One indexing service instance consumes the message from documents.ingested. It reads the book content from the datalake IMap, applies stop-word filtering and tokenization via TextTokenizer, and computes per-term frequencies using TermFrequencyAnalyzer.

Inverted index updated in Hazelcast

For each term, the indexing service writes a docId:frequency entry into the inverted-index IMap. An indexingRegistry ISet acts as an idempotency guard — if the book ID is already present, the indexing step is skipped. Book metadata (title, author, language, year) is written to the bookMetadata IMap.

Search query resolved from the distributed index

A client sends a GET /search request to Nginx on port 8080. Nginx proxies it to one of the available search-service instances using least-connections routing. The search service reads inverted-index and bookMetadata from Hazelcast, intersects posting lists for all query terms (AND semantics), applies any metadata filters, and returns a ranked result list.

Hazelcast distributed data structures

All Hazelcast data structures are configured with 2 synchronous backups and 1 asynchronous backup, providing redundancy against node failures. The cluster name is SearchEngine.

IMaps
ISet and IQueue

Name	Type	Producer	Consumer	Purpose
`datalake`	`IMap<Integer, BookContent>`	ingestion-service	indexing-service	Temporary in-memory storage of downloaded book content (header + body), keyed by book ID
`inverted-index`	`IMap<String, Set<String>>`	indexing-service	search-service	Term → set of `"docId:frequency"` strings; the core full-text search index
`bookMetadata`	`IMap<Integer, BookMetadata>`	indexing-service	search-service	Book metadata (title, author, language, year) for filter and display

Name	Type	Producer	Consumer	Purpose
`indexingRegistry`	`ISet<Integer>`	indexing-service	indexing-service	Idempotency guard; tracks which book IDs have already been indexed
`books`	`IQueue<Integer>`	HTTP clients / scheduler	ingestion-service	Work queue of book IDs pending download from Project Gutenberg
`log`	`ISet<Integer>`	ingestion-service	ingestion-service	Distributed set of downloaded book IDs; used by `/ingest/status` and `/ingest/list` endpoints

ActiveMQ messaging channels

documents.ingested (queue)

Type: Point-to-point queueProducer: ingestion-service — publishes a text message containing the book ID after a successful download.Consumer: indexing-service — each message is consumed by exactly one indexer instance. On receipt, the indexer calls IndexBook.execute(documentId).Payload: JSON text message — {"bookId": N} (e.g., {"bookId": 84} for Frankenstein).Effect: Triggers tokenization and inverted index population for the downloaded book.

ingestion.control (topic)

Type: Publish/subscribe topicProducer: Any service or operator that needs to pause or resume ingestion across the cluster.Consumer: ActiveMQIngestionControlConsumer in every ingestion-service instance.Payload: Control command string (pause / resume).Effect: Toggles the IngestionPauseController flag, which the periodic scheduler checks before dequeuing each book. Used during coordinated index rebuild to halt new ingestion.

index.rebuild.command (ActiveMQ topic)

Type: ActiveMQ publish/subscribe topicProducer: CoordinateRebuild in any indexing-service instance, triggered by POST /index/rebuild.Consumer: RebuildMessageListener in every indexing-service instance (each subscribes with a unique client ID).Payload: JSON rebuild command — {"epoch": <timestampMs>}.Effect: Each indexing node waits 10 seconds for cluster sync, then calls ReindexingExecutor.rebuildIndex(), which clears all distributed state and re-indexes from the datalake filesystem. On completion, the node counts down the Hazelcast CP rebuild-latch.

Network port reference

Application ports
Cluster and broker ports

Service	Port	Protocol	Description
ingestion-service	7001	HTTP (Javalin)	Book ingestion API: trigger, status, list
indexing-service	7002	HTTP (Javalin)	Indexing API: index document, rebuild, health
search-service	7003	HTTP (Javalin)	Search API: query, health
nginx	8080	HTTP	Load-balanced entry point for all search traffic

Service	Port	Protocol	Description
ingestion-service	5701	Hazelcast	Cluster member, seed node for other services
indexing-service	5702	Hazelcast	Cluster member, connects to seed at 5701
search-service	5703	Hazelcast	Cluster member, connects to seed at 5701
benchmarking	5704	Hazelcast	Cluster member (benchmark profile only)
activemq	61616	JMS / OpenWire	Message broker transport for all services
activemq	8161	HTTP	ActiveMQ web console

Nginx load balancer

Nginx listens on port 8080 and proxies /search and /health requests to the search_backend upstream group. The upstream uses least_conn scheduling — each request goes to the instance with the fewest active connections.

upstream search_backend {
    least_conn;

    server <NODE_IP>:7003 max_fails=10 fail_timeout=30s;
    # server <NODE_IP>:7003 max_fails=10 fail_timeout=30s;

    keepalive 64;
}

A search-service instance is removed from rotation after 10 consecutive failures within a 30-second window and re-admitted automatically once it recovers. To add search replicas, append additional server lines to nginx.conf and reload Nginx.

Replace every <NODE_IP> placeholder in nginx.conf and docker-compose.yml with the actual host IP address of the machine running each service before starting the cluster. Leaving xxx or <NODE_IP> as-is will prevent services from joining the Hazelcast cluster and connecting to the broker.

Docker Compose profiles

The deployment is split into three Docker Compose profiles so that different roles can be assigned to different physical machines:

Profile	Services included	Typical node
`backend`	ingestion-service, indexing-service, search-service	All compute nodes
`broker`	activemq	Dedicated broker node (or main node)
`loadbalancer`	nginx	Edge / main node
`benchmark`	benchmarking	Benchmarking node only

# Main node: all roles
docker compose --profile backend --profile broker --profile loadbalancer up -d

# Additional compute node: backend only
docker compose --profile backend up -d

Overview

Getting Started

Services

Operations

GuancheData Architecture: Services, Cluster, and Messaging

Service topology

Data flow

Hazelcast distributed data structures

ActiveMQ messaging channels

Network port reference

Nginx load balancer

Docker Compose profiles

Build docs developers (and LLMs) love

Overview

Getting Started

Services

Operations

Documentation Index

​Service topology

​Data flow

​Hazelcast distributed data structures

​ActiveMQ messaging channels

​Network port reference

​Nginx load balancer

​Docker Compose profiles

Build docs developers (and LLMs) love

Service topology

Data flow

Hazelcast distributed data structures

ActiveMQ messaging channels

Network port reference

Nginx load balancer

Docker Compose profiles