GuancheData Search Engine is a distributed full-text search platform that ingests, indexes, and queries books from Project Gutenberg across a cluster of cooperating nodes. The system achieves high throughput and fault tolerance through Hazelcast’s distributed in-memory inverted index, asynchronous messaging via ActiveMQ, and Nginx load balancing across search replicas.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/GuancheData/stage_3/llms.txt
Use this file to discover all available pages before exploring further.
Prerequisites
What you need before deploying the cluster: Java 17, Maven, Docker, and network configuration.
Build and Deploy
Compile all services with Maven and launch the full cluster using Docker Compose profiles.
Architecture
Understand how ingestion, indexing, and search services cooperate across multiple nodes.
Search API
Query the full-text search endpoint with filters for author, language, and publication year.
How it works
The system is composed of three core microservices that form a data pipeline:Ingest books
The ingestion service continuously fetches books from Project Gutenberg, stores them to a Hazelcast-replicated datalake, and notifies indexers via ActiveMQ.
Build the inverted index
Each indexing node consumes book events from the ActiveMQ queue, tokenizes the text, and writes term-frequency entries into a distributed Hazelcast IMap shared across all cluster members.
Key features
Distributed inverted index
Hazelcast shards and replicates the inverted index across the cluster with configurable sync and async backups.
Horizontal scalability
Add backend nodes at runtime — each new node automatically joins the Hazelcast cluster and picks up work.
Fault tolerance
Data replication and automatic Hazelcast recovery keep the system operational after node failures.
Index rebuild
Trigger a coordinated cluster-wide rebuild that pauses ingestion, re-indexes all data, and resumes automatically.
Load-balanced search
Nginx least-connections routing distributes queries across search replicas with automatic failover.
Benchmarking
Built-in tools measure ingestion rate, indexing throughput, and cluster recovery time.