OpenKnowledgeStream is a Spring Boot multi-module Maven project that continuously polls the Wikipedia Recent Changes API, publishes each change event to a Kafka topic (Documentation Index
Fetch the complete documentation index at: https://mintlify.com/amitsaxena098/OpenKnowledgeStream/llms.txt
Use this file to discover all available pages before exploring further.
recent_change_stream), and consumes that topic to index documents into OpenSearch. This guide walks you through getting the entire pipeline running on your local machine in minutes.
Prerequisites
Make sure the following are available before you start:
- Java 21 or later — the root
pom.xmlsetsjava.version=21. - Apache Kafka on
localhost:9092— hardcoded asbootstrap.serversin bothKafkaPublish(producer) andKafkaConsume(consumer). - OpenSearch on
localhost:9200— hardcoded inOpensearchConfigviaHttpHost("localhost", 9200). - Maven 3.8+ — used to build and package all modules.
Steps
Build all modules
wiki-common— sharedChange,Query, andRecentChangesmodel classes used by both other modules.opensearch-wiki-indexer— the Kafka consumer and OpenSearch indexer.wiki-change-stream— the Wikipedia poller and Kafka producer; this module depends on both of the above, and its executable fat JAR bundles the entire application.
-DskipTests flag speeds up the initial build. Drop it when you want the full test suite to run.Start Kafka (if not already running)
If you don’t have Kafka running locally, the quickest path is Docker:This exposes the broker on
localhost:9092, matching the hardcoded bootstrap.servers value in KafkaPublish and KafkaConsume.Start OpenSearch (if not already running)
Spin up a single-node OpenSearch instance with Docker:Port
9200 is the REST API port used by OpensearchConfig, and 9600 is the performance analyzer port. The discovery.type=single-node environment variable disables cluster bootstrapping for local use.Run the application
wiki-change-stream fat JAR is the single entry point for the whole pipeline. Its @ComponentScan annotation covers five Spring component namespaces — com.as, WikiIndexer, WikiIndexer.models, Wikicommon, and WikiChangeStream — so both the Kafka producer (KafkaPublish) and the Kafka consumer/indexer (KafkaConsume, OpensearchIndexer) start together inside one Spring Boot application context.The @Scheduled(fixedRate = 5000) on OpenStream.stream() polls the Wikipedia Recent Changes API every 5 seconds and publishes new Change objects to the recent_change_stream topic. A separate @Scheduled(fixedRate = 5000) on KafkaConsume.consume() polls Kafka and forwards each record to OpensearchIndexer.index().