Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/amitsaxena098/OpenKnowledgeStream/llms.txt

Use this file to discover all available pages before exploring further.

OpenKnowledgeStream is a Spring Boot multi-module Maven project that continuously polls the Wikipedia Recent Changes API, publishes each change event to a Kafka topic (recent_change_stream), and consumes that topic to index documents into OpenSearch. This guide walks you through getting the entire pipeline running on your local machine in minutes.

Prerequisites

Make sure the following are available before you start:
  • Java 21 or later — the root pom.xml sets java.version=21.
  • Apache Kafka on localhost:9092 — hardcoded as bootstrap.servers in both KafkaPublish (producer) and KafkaConsume (consumer).
  • OpenSearch on localhost:9200 — hardcoded in OpensearchConfig via HttpHost("localhost", 9200).
  • Maven 3.8+ — used to build and package all modules.

Steps

1

Clone the repository

git clone https://github.com/amitsaxena098/OpenKnowledgeStream.git && cd OpenKnowledgeStream
2

Build all modules

mvn clean install -DskipTests
Maven resolves the multi-module build order automatically. It compiles and installs:
  1. wiki-common — shared Change, Query, and RecentChanges model classes used by both other modules.
  2. opensearch-wiki-indexer — the Kafka consumer and OpenSearch indexer.
  3. wiki-change-stream — the Wikipedia poller and Kafka producer; this module depends on both of the above, and its executable fat JAR bundles the entire application.
The -DskipTests flag speeds up the initial build. Drop it when you want the full test suite to run.
3

Start Kafka (if not already running)

If you don’t have Kafka running locally, the quickest path is Docker:
docker run -d -p 9092:9092 apache/kafka:latest
This exposes the broker on localhost:9092, matching the hardcoded bootstrap.servers value in KafkaPublish and KafkaConsume.
4

Start OpenSearch (if not already running)

Spin up a single-node OpenSearch instance with Docker:
docker run -d -p 9200:9200 -p 9600:9600 \
  -e "discovery.type=single-node" \
  opensearchproject/opensearch:latest
Port 9200 is the REST API port used by OpensearchConfig, and 9600 is the performance analyzer port. The discovery.type=single-node environment variable disables cluster bootstrapping for local use.
5

Run the application

java -jar wiki-change-stream/target/wiki-change-stream-0.0.1-SNAPSHOT.jar
The wiki-change-stream fat JAR is the single entry point for the whole pipeline. Its @ComponentScan annotation covers five Spring component namespaces — com.as, WikiIndexer, WikiIndexer.models, Wikicommon, and WikiChangeStream — so both the Kafka producer (KafkaPublish) and the Kafka consumer/indexer (KafkaConsume, OpensearchIndexer) start together inside one Spring Boot application context.The @Scheduled(fixedRate = 5000) on OpenStream.stream() polls the Wikipedia Recent Changes API every 5 seconds and publishes new Change objects to the recent_change_stream topic. A separate @Scheduled(fixedRate = 5000) on KafkaConsume.consume() polls Kafka and forwards each record to OpensearchIndexer.index().
6

Verify the pipeline is working

After a few seconds, Wikipedia changes should start flowing into the wiki-changes index. Use this command to check the document count:
curl -s http://localhost:9200/wiki-changes/_count | jq .
A healthy response looks like:
{
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "count": 42
}
Re-run the command every few seconds and watch count increase as new Wikipedia edits are indexed.
The application logs are your best real-time diagnostic tool. You will see two recurring patterns in the output:
  • Change published with title: <PageTitle> — emitted by KafkaPublish.publish() each time a change is successfully delivered to the Kafka topic.
  • Indexed Title: <PageTitle> — emitted by OpensearchIndexer.index() each time a document is written to OpenSearch.
If you see the first log line but not the second, the issue is between Kafka and OpenSearch. If you see neither, check that Kafka is reachable on localhost:9092 and that the Wikipedia API is accessible from your network.

Build docs developers (and LLMs) love