Ingestion Service: crawling and datalake storage

The ingestion service is the entry point of the GuancheData pipeline. It runs a scheduled loop every 100 ms that pulls books from Project Gutenberg, partitions each book into a metadata header and a text body, writes both parts to a timestamped directory on the local filesystem, stores the BookContent object in the Hazelcast "datalake" IMap, replicates it to peer nodes, and finally publishes a {"bookId": N} JSON message to the ActiveMQ queue documents.ingested. An HTTP API lets operators trigger ingestion manually and inspect status.

Book download

Each book is fetched from:

https://www.gutenberg.org/cache/epub/{id}/pg{id}.txt

GutenbergBookContentSeparator splits the raw text at the literal marker *** START OF THE PROJECT GUTENBERG EBOOK (header ends just before it) and *** END OF THE PROJECT GUTENBERG EBOOK (body ends just before it). Content outside these two markers is discarded. If either marker is missing or out of order, the separator throws IllegalArgumentException and the book is recorded as an error.

Filesystem storage

After splitting, BookStorageDate writes two files under a date-and-hour directory rooted at the configured datalake path:

datalake/
└── {yyyyMMdd}/
    └── {HH}/
        ├── {bookId}_header.txt
        └── {bookId}_body.txt

The directory name components use GMT. For example, a book ingested on 15 May 2026 at 14:37 UTC would land in:

datalake/20260515/14/84_header.txt
datalake/20260515/14/84_body.txt

Hazelcast datalake and replication

The BookContent object (header + body strings) is stored in the Hazelcast "datalake" IMap under the integer book ID key. Immediately after the local write, datalake.replicate(bookId) pushes the entry onto the "booksToBeReplicated" distributed queue. Worker threads on peer nodes drain this queue and save the content locally, achieving cross-node replication. The number of additional copies made is controlled by the REPLICATION_FACTOR environment variable (default 1).

Ingestion loop and buffer factor

The BookIngestionPeriodicExecutor fires every 100 ms via a ScheduledExecutorService. Before pulling the next book ID from the ingestion queue, it checks whether the datalake is too full relative to current indexing capacity:

return currentSize < (bufferFactor * nodes);

bufferFactor is read from INDEXING_BUFFER_FACTOR (default 10). nodes is the live count of indexer instances visible in the Hazelcast cluster. If the condition is false, the tick is skipped entirely and no new book is fetched that cycle.

The loop also skips a tick whenever IngestionPauseController.isPaused() returns true. Pause state is driven by ActiveMQ topic ingestion.control: a INGESTION_PAUSE event sets the flag and INGESTION_RESUME clears it. The consumer uses a durable subscription so control events are not lost if the service restarts. This mechanism lets the indexing service throttle the ingestion service when it falls behind.

ActiveMQ notification

After a successful ingest, ActiveMQBookIngestedNotifier publishes the following payload to the documents.ingested queue:

{"bookId": 84}

The indexing service subscribes to this queue with CLIENT_ACKNOWLEDGE and processes each message exactly once.

HTTP endpoints

Method	Path	Description
`POST`	`/ingest/{book_id}`	Immediately downloads and ingests the specified book ID. Returns `already_downloaded` if the book was previously ingested.
`GET`	`/ingest/status/{book_id}`	Returns the download status of a single book.
`GET`	`/ingest/list`	Returns all downloaded book IDs from the distributed `"log"` ISet (cluster-wide).

All responses are JSON. The service listens on port 7001.

Overview

Getting Started

Services

Operations

Ingestion Service: crawling and datalake storage

Book download

Filesystem storage

Hazelcast datalake and replication

Ingestion loop and buffer factor

ActiveMQ notification

HTTP endpoints

Build docs developers (and LLMs) love

Overview

Getting Started

Services

Operations

Documentation Index

​Book download

​Filesystem storage

​Hazelcast datalake and replication

​Ingestion loop and buffer factor

​ActiveMQ notification

​HTTP endpoints

Build docs developers (and LLMs) love

Book download

Filesystem storage

Hazelcast datalake and replication

Ingestion loop and buffer factor

ActiveMQ notification

HTTP endpoints