GuancheData is built as three cooperating Java 17 microservices — ingestion, indexing, and search — that share state through a Hazelcast in-memory cluster and coordinate asynchronously over Apache ActiveMQ. An Nginx reverse proxy fronts the search tier. Every component is stateless at the process level; all shared data lives in the distributed Hazelcast cluster, so any node can fail and the remaining nodes continue serving requests without data loss.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/GuancheData/stage_3/llms.txt
Use this file to discover all available pages before exploring further.
Service topology
The cluster is divided into three logical tiers, each independently scalable:| Tier | Service | HTTP port | Hazelcast port | Role |
|---|---|---|---|---|
| Ingestion | ingestion-service | 7001 | 5701 | Downloads books, writes datalake, publishes events |
| Indexing | indexing-service | 7002 | 5702 | Consumes events, tokenizes text, updates inverted index |
| Search | search-service | 7003 | 5703 | Reads inverted index, executes queries, returns results |
| Load balancer | nginx | 8080 (→80) | — | Least-connections proxy across search instances |
| Message broker | activemq | 61616 (JMS) / 8161 (Web) | — | Async coordination between ingestion and indexing |
All services join the same Hazelcast cluster named
SearchEngine. The ingestion-service acts as the seed node (default port 5701); other services point HZ_MEMBERS at this address to join.Data flow
Book IDs enter the work queue
Book IDs are pushed into the Hazelcast
books IQueue, either via a direct HTTP POST /ingest/{book_id} call to the ingestion API or by the ingestion service’s internal periodic scheduler, which polls the queue every 100 ms.Ingestion service fetches from Project Gutenberg
The ingestion service dequeues a book ID, connects to Project Gutenberg, and downloads the full text. The content is written to both the local filesystem (
/app/datalake) and the Hazelcast datalake IMap for in-memory access by indexers.Notification published to ActiveMQ
After a successful download, the ingestion service immediately publishes a
{"bookId": N} JSON message to the documents.ingested queue on ActiveMQ. One message is sent per book — there is no batching. The INDEXING_BUFFER_FACTOR environment variable separately controls back-pressure: if the datalake grows beyond INDEXING_BUFFER_FACTOR × indexerCount entries, the periodic scheduler pauses before fetching the next book.Indexing service consumes the message
One indexing service instance consumes the message from
documents.ingested. It reads the book content from the datalake IMap, applies stop-word filtering and tokenization via TextTokenizer, and computes per-term frequencies using TermFrequencyAnalyzer.Inverted index updated in Hazelcast
For each term, the indexing service writes a
docId:frequency entry into the inverted-index IMap. An indexingRegistry ISet acts as an idempotency guard — if the book ID is already present, the indexing step is skipped. Book metadata (title, author, language, year) is written to the bookMetadata IMap.Search query resolved from the distributed index
A client sends a
GET /search request to Nginx on port 8080. Nginx proxies it to one of the available search-service instances using least-connections routing. The search service reads inverted-index and bookMetadata from Hazelcast, intersects posting lists for all query terms (AND semantics), applies any metadata filters, and returns a ranked result list.Hazelcast distributed data structures
All Hazelcast data structures are configured with 2 synchronous backups and 1 asynchronous backup, providing redundancy against node failures. The cluster name isSearchEngine.
- IMaps
- ISet and IQueue
| Name | Type | Producer | Consumer | Purpose |
|---|---|---|---|---|
datalake | IMap<Integer, BookContent> | ingestion-service | indexing-service | Temporary in-memory storage of downloaded book content (header + body), keyed by book ID |
inverted-index | IMap<String, Set<String>> | indexing-service | search-service | Term → set of "docId:frequency" strings; the core full-text search index |
bookMetadata | IMap<Integer, BookMetadata> | indexing-service | search-service | Book metadata (title, author, language, year) for filter and display |
ActiveMQ messaging channels
documents.ingested (queue)
documents.ingested (queue)
Type: Point-to-point queueProducer:
ingestion-service — publishes a text message containing the book ID after a successful download.Consumer: indexing-service — each message is consumed by exactly one indexer instance. On receipt, the indexer calls IndexBook.execute(documentId).Payload: JSON text message — {"bookId": N} (e.g., {"bookId": 84} for Frankenstein).Effect: Triggers tokenization and inverted index population for the downloaded book.ingestion.control (topic)
ingestion.control (topic)
Type: Publish/subscribe topicProducer: Any service or operator that needs to pause or resume ingestion across the cluster.Consumer:
ActiveMQIngestionControlConsumer in every ingestion-service instance.Payload: Control command string (pause / resume).Effect: Toggles the IngestionPauseController flag, which the periodic scheduler checks before dequeuing each book. Used during coordinated index rebuild to halt new ingestion.index.rebuild.command (ActiveMQ topic)
index.rebuild.command (ActiveMQ topic)
Type: ActiveMQ publish/subscribe topicProducer:
CoordinateRebuild in any indexing-service instance, triggered by POST /index/rebuild.Consumer: RebuildMessageListener in every indexing-service instance (each subscribes with a unique client ID).Payload: JSON rebuild command — {"epoch": <timestampMs>}.Effect: Each indexing node waits 10 seconds for cluster sync, then calls ReindexingExecutor.rebuildIndex(), which clears all distributed state and re-indexes from the datalake filesystem. On completion, the node counts down the Hazelcast CP rebuild-latch.Network port reference
- Application ports
- Cluster and broker ports
| Service | Port | Protocol | Description |
|---|---|---|---|
| ingestion-service | 7001 | HTTP (Javalin) | Book ingestion API: trigger, status, list |
| indexing-service | 7002 | HTTP (Javalin) | Indexing API: index document, rebuild, health |
| search-service | 7003 | HTTP (Javalin) | Search API: query, health |
| nginx | 8080 | HTTP | Load-balanced entry point for all search traffic |
Nginx load balancer
Nginx listens on port 8080 and proxies/search and /health requests to the search_backend upstream group. The upstream uses least_conn scheduling — each request goes to the instance with the fewest active connections.
server lines to nginx.conf and reload Nginx.
Docker Compose profiles
The deployment is split into three Docker Compose profiles so that different roles can be assigned to different physical machines:| Profile | Services included | Typical node |
|---|---|---|
backend | ingestion-service, indexing-service, search-service | All compute nodes |
broker | activemq | Dedicated broker node (or main node) |
loadbalancer | nginx | Edge / main node |
benchmark | benchmarking | Benchmarking node only |