The ingestion service is the entry point of the GuancheData pipeline. It runs a scheduled loop every 100 ms that pulls books from Project Gutenberg, partitions each book into a metadata header and a text body, writes both parts to a timestamped directory on the local filesystem, stores theDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/GuancheData/stage_3/llms.txt
Use this file to discover all available pages before exploring further.
BookContent object in the Hazelcast "datalake" IMap, replicates it to peer nodes, and finally publishes a {"bookId": N} JSON message to the ActiveMQ queue documents.ingested. An HTTP API lets operators trigger ingestion manually and inspect status.
Book download
Each book is fetched from:GutenbergBookContentSeparator splits the raw text at the literal marker *** START OF THE PROJECT GUTENBERG EBOOK (header ends just before it) and *** END OF THE PROJECT GUTENBERG EBOOK (body ends just before it). Content outside these two markers is discarded. If either marker is missing or out of order, the separator throws IllegalArgumentException and the book is recorded as an error.
Filesystem storage
After splitting,BookStorageDate writes two files under a date-and-hour directory rooted at the configured datalake path:
Hazelcast datalake and replication
TheBookContent object (header + body strings) is stored in the Hazelcast "datalake" IMap under the integer book ID key. Immediately after the local write, datalake.replicate(bookId) pushes the entry onto the "booksToBeReplicated" distributed queue. Worker threads on peer nodes drain this queue and save the content locally, achieving cross-node replication. The number of additional copies made is controlled by the REPLICATION_FACTOR environment variable (default 1).
Ingestion loop and buffer factor
TheBookIngestionPeriodicExecutor fires every 100 ms via a ScheduledExecutorService. Before pulling the next book ID from the ingestion queue, it checks whether the datalake is too full relative to current indexing capacity:
bufferFactor is read from INDEXING_BUFFER_FACTOR (default 10). nodes is the live count of indexer instances visible in the Hazelcast cluster. If the condition is false, the tick is skipped entirely and no new book is fetched that cycle.
The loop also skips a tick whenever
IngestionPauseController.isPaused() returns true.
Pause state is driven by ActiveMQ topic ingestion.control: a INGESTION_PAUSE event
sets the flag and INGESTION_RESUME clears it. The consumer uses a durable subscription
so control events are not lost if the service restarts. This mechanism lets the indexing
service throttle the ingestion service when it falls behind.ActiveMQ notification
After a successful ingest,ActiveMQBookIngestedNotifier publishes the following payload to the documents.ingested queue:
CLIENT_ACKNOWLEDGE and processes each message exactly once.
HTTP endpoints
| Method | Path | Description |
|---|---|---|
POST | /ingest/{book_id} | Immediately downloads and ingests the specified book ID. Returns already_downloaded if the book was previously ingested. |
GET | /ingest/status/{book_id} | Returns the download status of a single book. |
GET | /ingest/list | Returns all downloaded book IDs from the distributed "log" ISet (cluster-wide). |