Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/amitsaxena098/OpenKnowledgeStream/llms.txt

Use this file to discover all available pages before exploring further.

The default codebase hard-codes localhost:9092 for Kafka and localhost:9200 for OpenSearch directly in KafkaPublish, KafkaConsume, and OpensearchConfig. Before deploying to any environment beyond your laptop you must replace those values, rebuild the project, and choose a process supervision strategy. This guide covers the end-to-end path from local development to a production-ready deployment.

Pre-deployment checklist

1

Update the Kafka broker address in KafkaPublish

Open wiki-change-stream/src/main/java/WikiChangeStream/publish/KafkaPublish.java and replace the hardcoded value:
KafkaPublish.java
// Before
properties.put("bootstrap.servers", "localhost:9092");

// After
properties.put("bootstrap.servers", "broker1.prod.example.com:9092,broker2.prod.example.com:9092");
For a multi-broker cluster, provide a comma-separated list of host:port pairs. Not every broker needs to be listed — just enough for initial discovery.
2

Update the Kafka broker address in KafkaConsume

Make the same change in opensearch-wiki-indexer/src/main/java/WikiIndexer/consumer/KafkaConsume.java:
KafkaConsume.java
// Before
properties.put("bootstrap.servers", "localhost:9092");

// After
properties.put("bootstrap.servers", "broker1.prod.example.com:9092,broker2.prod.example.com:9092");
3

Update the OpenSearch host in OpensearchConfig

Edit opensearch-wiki-indexer/src/main/java/WikiIndexer/config/OpensearchConfig.java to point the RestClient at your production cluster:
OpensearchConfig.java
// Before
RestClient restClient =
    RestClient.builder(
        new HttpHost("localhost", 9200)
    ).build();

// After
RestClient restClient =
    RestClient.builder(
        new HttpHost("opensearch.prod.example.com", 443, "https")
    ).build();
Use "https" as the scheme and port 443 (or whichever your cluster exposes) when TLS is enabled.
4

Secure OpenSearch with authentication

Production OpenSearch clusters typically require authentication. Add a BasicCredentialsProvider to the RestClient builder in OpensearchConfig:
OpensearchConfig.java
import org.apache.http.auth.AuthScope;
import org.apache.http.auth.UsernamePasswordCredentials;
import org.apache.http.impl.client.BasicCredentialsProvider;

BasicCredentialsProvider credentialsProvider = new BasicCredentialsProvider();
credentialsProvider.setCredentials(
    AuthScope.ANY,
    new UsernamePasswordCredentials("admin", System.getenv("OPENSEARCH_PASSWORD"))
);

RestClient restClient =
    RestClient.builder(
        new HttpHost("opensearch.prod.example.com", 443, "https")
    )
    .setHttpClientConfigCallback(httpClientBuilder ->
        httpClientBuilder.setDefaultCredentialsProvider(credentialsProvider)
    )
    .build();
Read the password from an environment variable or a secrets manager — never commit credentials to source control.
5

Rebuild the project

mvn clean package -DskipTests
The executable fat JAR is produced at wiki-change-stream/target/wiki-change-stream-0.0.1-SNAPSHOT.jar. This is the only artifact you need to deploy.

Externalizing configuration with Spring Boot (suggested refactor)

The addresses for Kafka and OpenSearch are currently hardcoded directly in Java source files — there are no application.properties entries for them today. As a recommended improvement, you can move these addresses into Spring Boot externalized configuration so they can be changed at runtime without rebuilding. Step 1 — Inject values with @Value Replace the hardcoded string literals in KafkaPublish and KafkaConsume with injected fields:
KafkaPublish.java
import org.springframework.beans.factory.annotation.Value;

@Component
@Slf4j
public class KafkaPublish {

    @Value("${kafka.bootstrap.servers:localhost:9092}")
    private String bootstrapServers;

    KafkaPublish() {
        Properties properties = new Properties();
        properties.put("bootstrap.servers", bootstrapServers);
        properties.put("key.serializer", StringSerializer.class.getName());
        properties.put("value.serializer", JsonSerializer.class.getName());
        producer = new KafkaProducer<>(properties);
    }
    // ...
}
Apply the same pattern in KafkaConsume and expose opensearch.host, opensearch.port, and opensearch.scheme in OpensearchConfig. Step 2 — Add the new properties to application.properties Once the @Value annotations are in place, add the corresponding keys to wiki-change-stream/src/main/resources/application.properties:
application.properties
kafka.bootstrap.servers=your-broker:9092
opensearch.host=opensearch.prod.example.com
opensearch.port=443
opensearch.scheme=https
Step 3 — Override at runtime with environment variables Spring Boot automatically maps KAFKA_BOOTSTRAP_SERVERSkafka.bootstrap.servers (relaxed binding). You can therefore pass values as environment variables without modifying the properties file:
export KAFKA_BOOTSTRAP_SERVERS=broker1.prod.example.com:9092
export OPENSEARCH_HOST=opensearch.prod.example.com
java -jar wiki-change-stream/target/wiki-change-stream-0.0.1-SNAPSHOT.jar

Running as a systemd service

To keep the application running across reboots and have the OS restart it on failure, install it as a systemd unit. Step 1 — Copy the JAR to a stable location
sudo mkdir -p /opt/openknowledgestream
sudo cp wiki-change-stream/target/wiki-change-stream-0.0.1-SNAPSHOT.jar \
    /opt/openknowledgestream/openknowledgestream.jar
Step 2 — Create the unit file
/etc/systemd/system/openknowledgestream.service
[Unit]
Description=OpenKnowledgeStream Wikipedia Change Pipeline
After=network.target

[Service]
Type=simple
User=openknowledgestream
WorkingDirectory=/opt/openknowledgestream
ExecStart=/usr/bin/java -jar /opt/openknowledgestream/openknowledgestream.jar
EnvironmentFile=/etc/openknowledgestream/env
Restart=on-failure
RestartSec=10
StandardOutput=journal
StandardError=journal
SyslogIdentifier=openknowledgestream

[Install]
WantedBy=multi-user.target
Step 3 — Create the environment file
/etc/openknowledgestream/env
KAFKA_BOOTSTRAP_SERVERS=broker1.prod.example.com:9092
OPENSEARCH_HOST=opensearch.prod.example.com
OPENSEARCH_PORT=443
OPENSEARCH_SCHEME=https
OPENSEARCH_PASSWORD=changeme
Set restrictive permissions so only root and the service user can read it:
sudo chmod 640 /etc/openknowledgestream/env
sudo chown root:openknowledgestream /etc/openknowledgestream/env
Step 4 — Enable and start the service
sudo systemctl daemon-reload
sudo systemctl enable openknowledgestream
sudo systemctl start openknowledgestream
sudo systemctl status openknowledgestream
View live logs with:
sudo journalctl -u openknowledgestream -f

KafkaConsume sets auto.offset.reset=earliest, which means the consumer will re-read the entire Kafka topic from the beginning if the wiki-indexer consumer group offsets are lost or the group is deleted. In production, configure an appropriate Kafka topic retention policy (log.retention.hours or log.retention.bytes) to limit how much data can be re-consumed, and back up consumer group offsets if your indexing pipeline requires exactly-once delivery guarantees.

Build docs developers (and LLMs) love