GuancheData: Distributed Full-Text Search for Books

GuancheData Search Engine is an open-source distributed full-text search platform designed for researchers, educators, and developers who need to query large collections of public-domain books at scale. The system solves the fundamental challenge of indexing and searching millions of terms across thousands of documents without a single point of failure — distributing every layer of the pipeline across cooperating nodes using Hazelcast in-memory clustering, ActiveMQ asynchronous messaging, and Nginx load balancing.

GuancheData is licensed under the GPL-3.0 license. You are free to use, modify, and distribute it under the same terms.

What GuancheData does

The system continuously fetches books from Project Gutenberg, tokenizes their content into term-frequency pairs, stores a distributed inverted index in Hazelcast shared memory, and serves ranked full-text search results over HTTP. All three stages — ingestion, indexing, and search — run as independent Java 17 microservices that can be scaled horizontally without downtime.

Architecture

Understand the full service topology, Hazelcast data structures, ActiveMQ messaging channels, and network layout.

Build and deploy

Compile all services with a single Maven command and launch the cluster using Docker Compose profiles.

Services

Explore each microservice in detail: ingestion, indexing, search, and the Nginx load balancer.

Search API

Query the full-text search endpoint with support for multi-term AND queries and metadata filters.

Scaling

Add backend nodes at runtime — new nodes automatically join the Hazelcast cluster and begin serving traffic.

Benchmarking

Measure ingestion rate, indexing throughput, and cluster recovery time with the built-in benchmark service.

Core features

Distributed in-memory inverted index

The inverted index lives in a Hazelcast IMap (inverted-index) that is sharded and replicated across every cluster member. Entries map each term to a set of docId:frequency strings. With two synchronous backups and one asynchronous backup, the index survives node failures without data loss or manual intervention.

Asynchronous ingestion pipeline

Ingestion nodes fetch books from Project Gutenberg, write content to both the filesystem and the Hazelcast datalake IMap, then publish a notification to the ActiveMQ documents.ingested queue. Indexers consume these messages independently, decoupling download throughput from indexing throughput.

AND-semantics multi-term search

The search service reads the distributed inverted index, intersects posting lists for all query terms (AND semantics), applies optional metadata filters (author, language, publication year), and returns results ranked by term frequency or document ID.

Least-connections load balancing

Nginx fronts all search traffic on port 8080, routing each request to the search-service instance with the fewest active connections. Failed nodes are automatically bypassed after 10 consecutive failures within a 30-second window.

Coordinated index rebuild

A cluster-wide rebuild is triggered via POST /index/rebuild on any indexing node. The coordinating node pauses ingestion via ActiveMQ, broadcasts a rebuild command to all indexing nodes, waits for all nodes to complete (coordinated by a Hazelcast CP CountDownLatch), then automatically resumes ingestion.

Key facts

Property	Value
Language	Java 17
Build tool	Apache Maven 3.6+ (multi-module)
Deployment	Docker Compose with named profiles
In-memory cluster	Hazelcast 5.4.0
Message broker	Apache ActiveMQ
Load balancer	Nginx (least-connections)
Data source	Project Gutenberg
License	GPL-3.0

Each service is packaged as a fat JAR using the Maven Shade plugin. Docker Compose builds and runs the JARs directly — no separate docker build step is required before docker compose up.

Project structure

stage_3/
├── ingestion-service/   # Fetches books, writes datalake, notifies broker
├── indexing-service/    # Consumes broker messages, builds inverted index
├── search-service/      # HTTP search API, reads distributed index
├── benchmarking/        # Ingestion rate, indexing throughput, recovery time
├── nginx.conf           # Least-connections upstream configuration
├── docker-compose.yml   # Service definitions and profiles
└── pom.xml              # Multi-module Maven root

Overview

Getting Started

Services

Operations

GuancheData: Distributed Full-Text Search for Books

What GuancheData does

Architecture

Build and deploy

Services

Search API

Scaling

Benchmarking

Core features

Key facts

Project structure

Build docs developers (and LLMs) love

Overview

Getting Started

Services

Operations

Documentation Index

​What GuancheData does

Architecture

Build and deploy

Services

Search API

Scaling

Benchmarking

​Core features

​Key facts

​Project structure

Build docs developers (and LLMs) love

What GuancheData does

Core features

Key facts

Project structure