Skip to main content
BOOM exports OpenTelemetry metrics to Prometheus for monitoring pipeline performance and health. This guide covers available metrics and how to use them.

Accessing metrics

Prometheus runs at http://localhost:9090 when you start BOOM with Docker Compose. Open the Prometheus UI to query metrics and visualize pipeline performance.

Architecture

BOOM uses the OpenTelemetry SDK to export metrics:

Metrics initialization

Each BOOM binary initializes metrics on startup:
src/bin/scheduler.rs
let meter_provider = init_metrics(
    String::from("scheduler"),
    instance_id,
    deployment_env.clone(),
)
.expect("failed to initialize metrics");

Metric labels

All metrics include resource attributes:
  • service.name: Binary name (scheduler, consumer, producer)
  • service.instance.id: Unique UUID for this instance
  • service.namespace: Always "boom"
  • service.version: BOOM version from Cargo.toml
  • deployment.environment.name: Deployment environment (dev, prod, etc.)
The instance_id distinguishes metrics from multiple instances of the same service running in parallel.

Kafka consumer metrics

kafka_consumer_alert_processed_total

Type: Counter
Unit: {alert}
Description: Total number of alerts consumed from Kafka

Example queries

kafka_consumer_alert_processed_total

Pre-built queries

BOOM includes pre-configured Prometheus queries for the Kafka consumer.

Alert worker metrics

alert_worker_active

Type: UpDownCounter
Unit: {alert}
Description: Number of alerts currently being processed by alert workers
This gauge increases when workers start processing an alert and decreases when they finish.

alert_worker_alert_processed_total

Type: Counter
Unit: {alert}
Description: Total number of alerts processed by alert workers
Labels:
  • status: Processing outcome (success, error)

Example queries

sum by (status) (irate(alert_worker_alert_processed_total[5m]))

Pre-built queries

Access pre-configured alert worker queries in Prometheus.

Enrichment worker metrics

enrichment_worker_active

Type: UpDownCounter
Unit: {alert}
Description: Number of alerts currently being enriched

enrichment_worker_batch_processed_total

Type: Counter
Unit: {batch}
Description: Total number of enrichment batches processed

enrichment_worker_alert_processed_total

Type: Counter
Unit: {alert}
Description: Total number of alerts enriched
Labels:
  • status: Processing outcome (success, error)

Example queries

irate(enrichment_worker_alert_processed_total[5m])
/
irate(enrichment_worker_batch_processed_total[5m])
Monitor average batch size to optimize enrichment worker performance. Larger batches generally improve throughput.

Pre-built queries

View pre-configured enrichment worker queries.

Filter worker metrics

filter_worker_active

Type: UpDownCounter
Unit: {alert}
Description: Number of alerts currently being filtered

filter_worker_batch_processed_total

Type: Counter
Unit: {batch}
Description: Total number of filter batches executed

filter_worker_alert_processed_total

Type: Counter
Unit: {alert}
Description: Total number of alerts processed by filters
Labels:
  • reason: Filter outcome (passed, failed)

Example queries

sum(rate(filter_worker_alert_processed_total{reason="passed"}[5m]))
/
sum(rate(filter_worker_alert_processed_total[5m]))
* 100

Pre-built queries

Access pre-configured filter worker queries.

Global meters

BOOM defines separate meters for each binary:
src/utils/o11y/metrics.rs
/// Global OTel meter for the kafka consumer
pub static CONSUMER_METER: LazyLock<Meter> =
    LazyLock::new(|| opentelemetry::global::meter("boom-consumer-meter"));

/// Global OTel meter for the kafka producer
pub static PRODUCER_METER: LazyLock<Meter> =
    LazyLock::new(|| opentelemetry::global::meter("boom-producer-meter"));

/// Global OTel meter for the scheduler
pub static SCHEDULER_METER: LazyLock<Meter> =
    LazyLock::new(|| opentelemetry::global::meter("boom-scheduler-meter"));
Separate meters prevent metric collisions when multiple binaries run simultaneously.

Metric export configuration

Metrics are exported via OTLP over gRPC:
src/utils/o11y/metrics.rs
let exporter = opentelemetry_otlp::MetricExporter::builder()
    .with_temporality(Temporality::Cumulative)
    .with_tonic()
    .with_endpoint("https://localhost:4317/v1/metrics")
    .build()?;

let meter_provider = SdkMeterProvider::builder()
    .with_resource(resource)
    .with_periodic_exporter(exporter)  // Exports every 60 seconds
    .build();

Temporality

BOOM uses cumulative temporality, which is more natural for Prometheus:
  • Counters report cumulative totals since process start
  • Prometheus calculates rates using rate() or irate()
  • Better compatibility with Prometheus than delta temporality

Dashboard examples

Pipeline throughput

Visualize end-to-end pipeline throughput:
sum(irate(kafka_consumer_alert_processed_total[5m])) by (job)

Worker health

Monitor active workers across all stages:
sum(alert_worker_active) +
sum(enrichment_worker_active) +
sum(filter_worker_active)

Error rates

Track processing errors:
sum(rate(alert_worker_alert_processed_total{status="error"}[5m])) +
sum(rate(enrichment_worker_alert_processed_total{status="error"}[5m]))

Filter effectiveness

Measure filter selectivity:
sum(rate(filter_worker_alert_processed_total{reason="passed"}[5m]))
/
sum(rate(filter_worker_alert_processed_total[5m]))

Alerting

Example Prometheus alerts

prometheus-alerts.yaml
groups:
  - name: boom
    interval: 30s
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(alert_worker_alert_processed_total{status="error"}[5m]))
          /
          sum(rate(alert_worker_alert_processed_total[5m]))
          > 0.05
        for: 5m
        annotations:
          summary: "Alert worker error rate above 5%"
          
      - alert: NoAlertsProcessed
        expr: |
          rate(kafka_consumer_alert_processed_total[5m]) == 0
        for: 10m
        annotations:
          summary: "No alerts consumed in 10 minutes"
          
      - alert: FilterWorkerStalled
        expr: |
          filter_worker_active > 0
          and
          rate(filter_worker_batch_processed_total[5m]) == 0
        for: 5m
        annotations:
          summary: "Filter workers stalled with active alerts"

Graceful shutdown

Metrics are flushed on graceful shutdown:
src/bin/scheduler.rs
if let Err(error) = meter_provider.shutdown() {
    log_error!(WARN, error, "failed to shut down the meter provider");
}
If a binary crashes or is killed with SIGKILL, final metrics may not be exported to Prometheus.

Metric retention

Prometheus retention is configured in the Docker Compose setup. By default:
  • Retention time: 15 days
  • Storage location: Docker volume
Modify docker-compose.yaml to adjust retention settings.

Next steps

Logging

Configure structured logging and tracing

Processing alerts

Understand the alert processing pipeline

Prometheus docs

Learn more about Prometheus query language

Build docs developers (and LLMs) love