This guide walks you through cloning OpenKnowledgeStream, building all three Maven modules, and launching the pipeline so that Wikipedia Recent Changes are streamed through Kafka and indexed into OpenSearch in real time. The entire process takes less than five minutes once the prerequisites are in place.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/amitsaxena098/OpenKnowledgeStream/llms.txt
Use this file to discover all available pages before exploring further.
Confirm prerequisites
Make sure the following services and tools are installed and running before you proceed.Java 21+Apache Kafka must be running with a broker accessible at Then create the topic that the pipeline uses:OpenSearch must be running with a node accessible at
localhost:9092. If you are using a local Kafka installation, start ZooKeeper and the broker:localhost:9200:Clone the repository
Clone the OpenKnowledgeStream source from GitHub:The repository root contains the parent
pom.xml and the three module directories:Build the project
Build all modules from the repository root using the included Maven Wrapper (or your local Maven resolves the inter-module dependencies in the correct order (
mvn installation):wiki-common → opensearch-wiki-indexer → wiki-change-stream) and produces a fat JAR for each executable module under its target/ directory.A successful build ends with output similar to:Start the application
Launch the On startup, Spring Boot will:
wiki-change-stream fat JAR. Because wiki-change-stream declares opensearch-wiki-indexer as a compile-scope dependency, the single JAR contains both the Kafka producer and the Kafka consumer + OpenSearch indexer — you only need to start one process:- Wire the
WikipediaClientwith aWebClientpointed at the Wikipedia Recent Changes API. - Register the
OpenStreamscheduled task (fires every 5 seconds) to poll Wikipedia and publishChangeevents to therecent_change_streamKafka topic. - Register the
KafkaConsumescheduled task (also fires every 5 seconds) to poll the same topic and forward eachChangetoOpensearchIndexerfor indexing into thewiki-changesindex.
The application polls Wikipedia every 5 seconds and publishes up to 100 changes per poll. If Wikipedia returns HTTP 429 (rate limit), the producer logs a warning and automatically sleeps for 5 seconds before the next scheduled tick resumes — no manual intervention is needed.
Verify documents in OpenSearch
Query the A successful response looks like:Each document’s
wiki-changes index to confirm that Wikipedia change documents are being indexed:_id is the Wikipedia page title, so repeated edits to the same article upsert the existing document rather than creating duplicates.To check how many unique pages have been indexed: