GuancheData Search Engine is an open-source distributed full-text search platform designed for researchers, educators, and developers who need to query large collections of public-domain books at scale. The system solves the fundamental challenge of indexing and searching millions of terms across thousands of documents without a single point of failure — distributing every layer of the pipeline across cooperating nodes using Hazelcast in-memory clustering, ActiveMQ asynchronous messaging, and Nginx load balancing.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/GuancheData/stage_3/llms.txt
Use this file to discover all available pages before exploring further.
GuancheData is licensed under the GPL-3.0 license. You are free to use, modify, and distribute it under the same terms.
What GuancheData does
The system continuously fetches books from Project Gutenberg, tokenizes their content into term-frequency pairs, stores a distributed inverted index in Hazelcast shared memory, and serves ranked full-text search results over HTTP. All three stages — ingestion, indexing, and search — run as independent Java 17 microservices that can be scaled horizontally without downtime.Architecture
Understand the full service topology, Hazelcast data structures, ActiveMQ messaging channels, and network layout.
Build and deploy
Compile all services with a single Maven command and launch the cluster using Docker Compose profiles.
Services
Explore each microservice in detail: ingestion, indexing, search, and the Nginx load balancer.
Search API
Query the full-text search endpoint with support for multi-term AND queries and metadata filters.
Scaling
Add backend nodes at runtime — new nodes automatically join the Hazelcast cluster and begin serving traffic.
Benchmarking
Measure ingestion rate, indexing throughput, and cluster recovery time with the built-in benchmark service.
Core features
Distributed in-memory inverted index
Distributed in-memory inverted index
The inverted index lives in a Hazelcast
IMap (inverted-index) that is sharded and replicated across every cluster member. Entries map each term to a set of docId:frequency strings. With two synchronous backups and one asynchronous backup, the index survives node failures without data loss or manual intervention.Asynchronous ingestion pipeline
Asynchronous ingestion pipeline
Ingestion nodes fetch books from Project Gutenberg, write content to both the filesystem and the Hazelcast
datalake IMap, then publish a notification to the ActiveMQ documents.ingested queue. Indexers consume these messages independently, decoupling download throughput from indexing throughput.AND-semantics multi-term search
AND-semantics multi-term search
The search service reads the distributed inverted index, intersects posting lists for all query terms (AND semantics), applies optional metadata filters (author, language, publication year), and returns results ranked by term frequency or document ID.
Least-connections load balancing
Least-connections load balancing
Nginx fronts all search traffic on port 8080, routing each request to the search-service instance with the fewest active connections. Failed nodes are automatically bypassed after 10 consecutive failures within a 30-second window.
Coordinated index rebuild
Coordinated index rebuild
A cluster-wide rebuild is triggered via
POST /index/rebuild on any indexing node. The coordinating node pauses ingestion via ActiveMQ, broadcasts a rebuild command to all indexing nodes, waits for all nodes to complete (coordinated by a Hazelcast CP CountDownLatch), then automatically resumes ingestion.Key facts
| Property | Value |
|---|---|
| Language | Java 17 |
| Build tool | Apache Maven 3.6+ (multi-module) |
| Deployment | Docker Compose with named profiles |
| In-memory cluster | Hazelcast 5.4.0 |
| Message broker | Apache ActiveMQ |
| Load balancer | Nginx (least-connections) |
| Data source | Project Gutenberg |
| License | GPL-3.0 |
Each service is packaged as a fat JAR using the Maven Shade plugin. Docker Compose builds and runs the JARs directly — no separate
docker build step is required before docker compose up.