Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/GuancheData/stage_3/llms.txt

Use this file to discover all available pages before exploring further.

GuancheData is built as three cooperating Java 17 microservices — ingestion, indexing, and search — that share state through a Hazelcast in-memory cluster and coordinate asynchronously over Apache ActiveMQ. An Nginx reverse proxy fronts the search tier. Every component is stateless at the process level; all shared data lives in the distributed Hazelcast cluster, so any node can fail and the remaining nodes continue serving requests without data loss.

Service topology

The cluster is divided into three logical tiers, each independently scalable:
TierServiceHTTP portHazelcast portRole
Ingestioningestion-service70015701Downloads books, writes datalake, publishes events
Indexingindexing-service70025702Consumes events, tokenizes text, updates inverted index
Searchsearch-service70035703Reads inverted index, executes queries, returns results
Load balancernginx8080 (→80)Least-connections proxy across search instances
Message brokeractivemq61616 (JMS) / 8161 (Web)Async coordination between ingestion and indexing
All services join the same Hazelcast cluster named SearchEngine. The ingestion-service acts as the seed node (default port 5701); other services point HZ_MEMBERS at this address to join.

Data flow

1

Book IDs enter the work queue

Book IDs are pushed into the Hazelcast books IQueue, either via a direct HTTP POST /ingest/{book_id} call to the ingestion API or by the ingestion service’s internal periodic scheduler, which polls the queue every 100 ms.
2

Ingestion service fetches from Project Gutenberg

The ingestion service dequeues a book ID, connects to Project Gutenberg, and downloads the full text. The content is written to both the local filesystem (/app/datalake) and the Hazelcast datalake IMap for in-memory access by indexers.
3

Notification published to ActiveMQ

After a successful download, the ingestion service immediately publishes a {"bookId": N} JSON message to the documents.ingested queue on ActiveMQ. One message is sent per book — there is no batching. The INDEXING_BUFFER_FACTOR environment variable separately controls back-pressure: if the datalake grows beyond INDEXING_BUFFER_FACTOR × indexerCount entries, the periodic scheduler pauses before fetching the next book.
4

Indexing service consumes the message

One indexing service instance consumes the message from documents.ingested. It reads the book content from the datalake IMap, applies stop-word filtering and tokenization via TextTokenizer, and computes per-term frequencies using TermFrequencyAnalyzer.
5

Inverted index updated in Hazelcast

For each term, the indexing service writes a docId:frequency entry into the inverted-index IMap. An indexingRegistry ISet acts as an idempotency guard — if the book ID is already present, the indexing step is skipped. Book metadata (title, author, language, year) is written to the bookMetadata IMap.
6

Search query resolved from the distributed index

A client sends a GET /search request to Nginx on port 8080. Nginx proxies it to one of the available search-service instances using least-connections routing. The search service reads inverted-index and bookMetadata from Hazelcast, intersects posting lists for all query terms (AND semantics), applies any metadata filters, and returns a ranked result list.

Hazelcast distributed data structures

All Hazelcast data structures are configured with 2 synchronous backups and 1 asynchronous backup, providing redundancy against node failures. The cluster name is SearchEngine.
NameTypeProducerConsumerPurpose
datalakeIMap<Integer, BookContent>ingestion-serviceindexing-serviceTemporary in-memory storage of downloaded book content (header + body), keyed by book ID
inverted-indexIMap<String, Set<String>>indexing-servicesearch-serviceTerm → set of "docId:frequency" strings; the core full-text search index
bookMetadataIMap<Integer, BookMetadata>indexing-servicesearch-serviceBook metadata (title, author, language, year) for filter and display

ActiveMQ messaging channels

Type: Point-to-point queueProducer: ingestion-service — publishes a text message containing the book ID after a successful download.Consumer: indexing-service — each message is consumed by exactly one indexer instance. On receipt, the indexer calls IndexBook.execute(documentId).Payload: JSON text message — {"bookId": N} (e.g., {"bookId": 84} for Frankenstein).Effect: Triggers tokenization and inverted index population for the downloaded book.
Type: Publish/subscribe topicProducer: Any service or operator that needs to pause or resume ingestion across the cluster.Consumer: ActiveMQIngestionControlConsumer in every ingestion-service instance.Payload: Control command string (pause / resume).Effect: Toggles the IngestionPauseController flag, which the periodic scheduler checks before dequeuing each book. Used during coordinated index rebuild to halt new ingestion.
Type: ActiveMQ publish/subscribe topicProducer: CoordinateRebuild in any indexing-service instance, triggered by POST /index/rebuild.Consumer: RebuildMessageListener in every indexing-service instance (each subscribes with a unique client ID).Payload: JSON rebuild command — {"epoch": <timestampMs>}.Effect: Each indexing node waits 10 seconds for cluster sync, then calls ReindexingExecutor.rebuildIndex(), which clears all distributed state and re-indexes from the datalake filesystem. On completion, the node counts down the Hazelcast CP rebuild-latch.

Network port reference

ServicePortProtocolDescription
ingestion-service7001HTTP (Javalin)Book ingestion API: trigger, status, list
indexing-service7002HTTP (Javalin)Indexing API: index document, rebuild, health
search-service7003HTTP (Javalin)Search API: query, health
nginx8080HTTPLoad-balanced entry point for all search traffic

Nginx load balancer

Nginx listens on port 8080 and proxies /search and /health requests to the search_backend upstream group. The upstream uses least_conn scheduling — each request goes to the instance with the fewest active connections.
upstream search_backend {
    least_conn;

    server <NODE_IP>:7003 max_fails=10 fail_timeout=30s;
    # server <NODE_IP>:7003 max_fails=10 fail_timeout=30s;

    keepalive 64;
}
A search-service instance is removed from rotation after 10 consecutive failures within a 30-second window and re-admitted automatically once it recovers. To add search replicas, append additional server lines to nginx.conf and reload Nginx.
Replace every <NODE_IP> placeholder in nginx.conf and docker-compose.yml with the actual host IP address of the machine running each service before starting the cluster. Leaving xxx or <NODE_IP> as-is will prevent services from joining the Hazelcast cluster and connecting to the broker.

Docker Compose profiles

The deployment is split into three Docker Compose profiles so that different roles can be assigned to different physical machines:
ProfileServices includedTypical node
backendingestion-service, indexing-service, search-serviceAll compute nodes
brokeractivemqDedicated broker node (or main node)
loadbalancernginxEdge / main node
benchmarkbenchmarkingBenchmarking node only
# Main node: all roles
docker compose --profile backend --profile broker --profile loadbalancer up -d

# Additional compute node: backend only
docker compose --profile backend up -d

Build docs developers (and LLMs) love