Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/GuancheData/stage_3/llms.txt

Use this file to discover all available pages before exploring further.

GuancheData Search Engine is an open-source distributed full-text search platform designed for researchers, educators, and developers who need to query large collections of public-domain books at scale. The system solves the fundamental challenge of indexing and searching millions of terms across thousands of documents without a single point of failure — distributing every layer of the pipeline across cooperating nodes using Hazelcast in-memory clustering, ActiveMQ asynchronous messaging, and Nginx load balancing.
GuancheData is licensed under the GPL-3.0 license. You are free to use, modify, and distribute it under the same terms.

What GuancheData does

The system continuously fetches books from Project Gutenberg, tokenizes their content into term-frequency pairs, stores a distributed inverted index in Hazelcast shared memory, and serves ranked full-text search results over HTTP. All three stages — ingestion, indexing, and search — run as independent Java 17 microservices that can be scaled horizontally without downtime.

Architecture

Understand the full service topology, Hazelcast data structures, ActiveMQ messaging channels, and network layout.

Build and deploy

Compile all services with a single Maven command and launch the cluster using Docker Compose profiles.

Services

Explore each microservice in detail: ingestion, indexing, search, and the Nginx load balancer.

Search API

Query the full-text search endpoint with support for multi-term AND queries and metadata filters.

Scaling

Add backend nodes at runtime — new nodes automatically join the Hazelcast cluster and begin serving traffic.

Benchmarking

Measure ingestion rate, indexing throughput, and cluster recovery time with the built-in benchmark service.

Core features

The inverted index lives in a Hazelcast IMap (inverted-index) that is sharded and replicated across every cluster member. Entries map each term to a set of docId:frequency strings. With two synchronous backups and one asynchronous backup, the index survives node failures without data loss or manual intervention.
Ingestion nodes fetch books from Project Gutenberg, write content to both the filesystem and the Hazelcast datalake IMap, then publish a notification to the ActiveMQ documents.ingested queue. Indexers consume these messages independently, decoupling download throughput from indexing throughput.
Nginx fronts all search traffic on port 8080, routing each request to the search-service instance with the fewest active connections. Failed nodes are automatically bypassed after 10 consecutive failures within a 30-second window.
A cluster-wide rebuild is triggered via POST /index/rebuild on any indexing node. The coordinating node pauses ingestion via ActiveMQ, broadcasts a rebuild command to all indexing nodes, waits for all nodes to complete (coordinated by a Hazelcast CP CountDownLatch), then automatically resumes ingestion.

Key facts

PropertyValue
LanguageJava 17
Build toolApache Maven 3.6+ (multi-module)
DeploymentDocker Compose with named profiles
In-memory clusterHazelcast 5.4.0
Message brokerApache ActiveMQ
Load balancerNginx (least-connections)
Data sourceProject Gutenberg
LicenseGPL-3.0
Each service is packaged as a fat JAR using the Maven Shade plugin. Docker Compose builds and runs the JARs directly — no separate docker build step is required before docker compose up.

Project structure

stage_3/
├── ingestion-service/   # Fetches books, writes datalake, notifies broker
├── indexing-service/    # Consumes broker messages, builds inverted index
├── search-service/      # HTTP search API, reads distributed index
├── benchmarking/        # Ingestion rate, indexing throughput, recovery time
├── nginx.conf           # Least-connections upstream configuration
├── docker-compose.yml   # Service definitions and profiles
└── pom.xml              # Multi-module Maven root

Build docs developers (and LLMs) love