GuancheData Search Engine: Distributed Full-Text Search

GuancheData Search Engine is a distributed full-text search platform that ingests, indexes, and queries books from Project Gutenberg across a cluster of cooperating nodes. The system achieves high throughput and fault tolerance through Hazelcast’s distributed in-memory inverted index, asynchronous messaging via ActiveMQ, and Nginx load balancing across search replicas.

Prerequisites

What you need before deploying the cluster: Java 17, Maven, Docker, and network configuration.

Build and Deploy

Compile all services with Maven and launch the full cluster using Docker Compose profiles.

Architecture

Understand how ingestion, indexing, and search services cooperate across multiple nodes.

Search API

Query the full-text search endpoint with filters for author, language, and publication year.

How it works

The system is composed of three core microservices that form a data pipeline:

Ingest books

The ingestion service continuously fetches books from Project Gutenberg, stores them to a Hazelcast-replicated datalake, and notifies indexers via ActiveMQ.

Build the inverted index

Each indexing node consumes book events from the ActiveMQ queue, tokenizes the text, and writes term-frequency entries into a distributed Hazelcast IMap shared across all cluster members.

Search at low latency

The search service performs parallel lookups across the in-memory inverted index, applies metadata filters, and returns ranked results. Nginx distributes traffic across multiple search instances.

Key features

Distributed inverted index

Hazelcast shards and replicates the inverted index across the cluster with configurable sync and async backups.

Horizontal scalability

Add backend nodes at runtime — each new node automatically joins the Hazelcast cluster and picks up work.

Fault tolerance

Data replication and automatic Hazelcast recovery keep the system operational after node failures.

Index rebuild

Trigger a coordinated cluster-wide rebuild that pauses ingestion, re-indexes all data, and resumes automatically.

Load-balanced search

Nginx least-connections routing distributes queries across search replicas with automatic failover.

Benchmarking

Built-in tools measure ingestion rate, indexing throughput, and cluster recovery time.

Overview

Getting Started

Services

Operations

GuancheData Search Engine: Distributed Full-Text Search

Prerequisites

Build and Deploy

Architecture

Search API

How it works

Key features

Distributed inverted index

Horizontal scalability

Fault tolerance

Index rebuild

Load-balanced search

Benchmarking

Build docs developers (and LLMs) love

Overview

Getting Started

Services

Operations

Documentation Index

Prerequisites

Build and Deploy

Architecture

Search API

​How it works

​Key features

Distributed inverted index

Horizontal scalability

Fault tolerance

Index rebuild

Load-balanced search

Benchmarking

Build docs developers (and LLMs) love

How it works

Key features