Monitoring - BOOM

BOOM exports OpenTelemetry metrics to Prometheus for monitoring pipeline performance and health. This guide covers available metrics and how to use them.

Accessing metrics

Prometheus runs at http://localhost:9090 when you start BOOM with Docker Compose. Open the Prometheus UI to query metrics and visualize pipeline performance.

Architecture

BOOM uses the OpenTelemetry SDK to export metrics:

Metrics initialization

Each BOOM binary initializes metrics on startup:

src/bin/scheduler.rs

let meter_provider = init_metrics(
    String::from("scheduler"),
    instance_id,
    deployment_env.clone(),
)
.expect("failed to initialize metrics");

Metric labels

All metrics include resource attributes:

service.name: Binary name (scheduler, consumer, producer)
service.instance.id: Unique UUID for this instance
service.namespace: Always "boom"
service.version: BOOM version from Cargo.toml
deployment.environment.name: Deployment environment (dev, prod, etc.)

The instance_id distinguishes metrics from multiple instances of the same service running in parallel.

Kafka consumer metrics

kafka_consumer_alert_processed_total

Type: Counter
Unit: {alert}
Description: Total number of alerts consumed from Kafka

Example queries

kafka_consumer_alert_processed_total

Pre-built queries

BOOM includes pre-configured Prometheus queries for the Kafka consumer.

Alert worker metrics

alert_worker_active

Type: UpDownCounter
Unit: {alert}
Description: Number of alerts currently being processed by alert workers This gauge increases when workers start processing an alert and decreases when they finish.

alert_worker_alert_processed_total

Type: Counter
Unit: {alert}
Description: Total number of alerts processed by alert workers Labels:

status: Processing outcome (success, error)

Example queries

sum by (status) (irate(alert_worker_alert_processed_total[5m]))

Pre-built queries

Access pre-configured alert worker queries in Prometheus.

Enrichment worker metrics

enrichment_worker_active

Type: UpDownCounter
Unit: {alert}
Description: Number of alerts currently being enriched

enrichment_worker_batch_processed_total

Type: Counter
Unit: {batch}
Description: Total number of enrichment batches processed

enrichment_worker_alert_processed_total

Type: Counter
Unit: {alert}
Description: Total number of alerts enriched Labels:

status: Processing outcome (success, error)

Example queries

irate(enrichment_worker_alert_processed_total[5m])
/
irate(enrichment_worker_batch_processed_total[5m])

Monitor average batch size to optimize enrichment worker performance. Larger batches generally improve throughput.

Pre-built queries

View pre-configured enrichment worker queries.

Filter worker metrics

filter_worker_active

Type: UpDownCounter
Unit: {alert}
Description: Number of alerts currently being filtered

filter_worker_batch_processed_total

Type: Counter
Unit: {batch}
Description: Total number of filter batches executed

filter_worker_alert_processed_total

Type: Counter
Unit: {alert}
Description: Total number of alerts processed by filters Labels:

reason: Filter outcome (passed, failed)

Example queries

sum(rate(filter_worker_alert_processed_total{reason="passed"}[5m]))
/
sum(rate(filter_worker_alert_processed_total[5m]))
* 100

Pre-built queries

Access pre-configured filter worker queries.

Global meters

BOOM defines separate meters for each binary:

src/utils/o11y/metrics.rs

/// Global OTel meter for the kafka consumer
pub static CONSUMER_METER: LazyLock<Meter> =
    LazyLock::new(|| opentelemetry::global::meter("boom-consumer-meter"));

/// Global OTel meter for the kafka producer
pub static PRODUCER_METER: LazyLock<Meter> =
    LazyLock::new(|| opentelemetry::global::meter("boom-producer-meter"));

/// Global OTel meter for the scheduler
pub static SCHEDULER_METER: LazyLock<Meter> =
    LazyLock::new(|| opentelemetry::global::meter("boom-scheduler-meter"));

Separate meters prevent metric collisions when multiple binaries run simultaneously.

Metric export configuration

Metrics are exported via OTLP over gRPC:

src/utils/o11y/metrics.rs

let exporter = opentelemetry_otlp::MetricExporter::builder()
    .with_temporality(Temporality::Cumulative)
    .with_tonic()
    .with_endpoint("https://localhost:4317/v1/metrics")
    .build()?;

let meter_provider = SdkMeterProvider::builder()
    .with_resource(resource)
    .with_periodic_exporter(exporter)  // Exports every 60 seconds
    .build();

Temporality

BOOM uses cumulative temporality, which is more natural for Prometheus:

Counters report cumulative totals since process start
Prometheus calculates rates using rate() or irate()
Better compatibility with Prometheus than delta temporality

Dashboard examples

Pipeline throughput

Visualize end-to-end pipeline throughput:

sum(irate(kafka_consumer_alert_processed_total[5m])) by (job)

Worker health

Monitor active workers across all stages:

sum(alert_worker_active) +
sum(enrichment_worker_active) +
sum(filter_worker_active)

Error rates

Track processing errors:

sum(rate(alert_worker_alert_processed_total{status="error"}[5m])) +
sum(rate(enrichment_worker_alert_processed_total{status="error"}[5m]))

Filter effectiveness

Measure filter selectivity:

sum(rate(filter_worker_alert_processed_total{reason="passed"}[5m]))
/
sum(rate(filter_worker_alert_processed_total[5m]))

Alerting

Example Prometheus alerts

prometheus-alerts.yaml

groups:
  - name: boom
    interval: 30s
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(alert_worker_alert_processed_total{status="error"}[5m]))
          /
          sum(rate(alert_worker_alert_processed_total[5m]))
          > 0.05
        for: 5m
        annotations:
          summary: "Alert worker error rate above 5%"
          
      - alert: NoAlertsProcessed
        expr: |
          rate(kafka_consumer_alert_processed_total[5m]) == 0
        for: 10m
        annotations:
          summary: "No alerts consumed in 10 minutes"
          
      - alert: FilterWorkerStalled
        expr: |
          filter_worker_active > 0
          and
          rate(filter_worker_batch_processed_total[5m]) == 0
        for: 5m
        annotations:
          summary: "Filter workers stalled with active alerts"

Graceful shutdown

Metrics are flushed on graceful shutdown:

src/bin/scheduler.rs

if let Err(error) = meter_provider.shutdown() {
    log_error!(WARN, error, "failed to shut down the meter provider");
}

If a binary crashes or is killed with SIGKILL, final metrics may not be exported to Prometheus.

Metric retention

Prometheus retention is configured in the Docker Compose setup. By default:

Retention time: 15 days
Storage location: Docker volume

Modify docker-compose.yaml to adjust retention settings.

Next steps

Logging

Configure structured logging and tracing

Processing alerts

Understand the alert processing pipeline

Prometheus docs

Learn more about Prometheus query language

Getting Started

Core Concepts

Guides

API Reference

Development

​Accessing metrics

​Architecture

​Metrics initialization

​Metric labels

​Kafka consumer metrics

​kafka_consumer_alert_processed_total

​Example queries

​Pre-built queries

​Alert worker metrics

​alert_worker_active

​alert_worker_alert_processed_total

​Example queries

​Pre-built queries

​Enrichment worker metrics

​enrichment_worker_active

​enrichment_worker_batch_processed_total

​enrichment_worker_alert_processed_total

​Example queries

​Pre-built queries

​Filter worker metrics

​filter_worker_active

​filter_worker_batch_processed_total

​filter_worker_alert_processed_total

​Example queries

​Pre-built queries

​Global meters

​Metric export configuration

​Temporality

​Dashboard examples

​Pipeline throughput

​Worker health

​Error rates

​Filter effectiveness

​Alerting

​Example Prometheus alerts

​Graceful shutdown

​Metric retention

​Next steps

Logging

Processing alerts

Prometheus docs

Build docs developers (and LLMs) love

Accessing metrics

Architecture

Metrics initialization

Metric labels

Kafka consumer metrics

kafka_consumer_alert_processed_total

Example queries

Pre-built queries

Alert worker metrics

alert_worker_active

alert_worker_alert_processed_total

Example queries

Pre-built queries

Enrichment worker metrics

enrichment_worker_active

enrichment_worker_batch_processed_total

enrichment_worker_alert_processed_total

Example queries

Pre-built queries

Filter worker metrics

filter_worker_active

filter_worker_batch_processed_total

filter_worker_alert_processed_total

Example queries

Pre-built queries

Global meters

Metric export configuration

Temporality

Dashboard examples

Pipeline throughput

Worker health

Error rates

Filter effectiveness

Alerting

Example Prometheus alerts

Graceful shutdown

Metric retention

Next steps