Built-in Metrics
Queue Metrics (Bull 4.7.0+)
Bull includes built-in support for collecting queue metrics. Metrics are automatically collected when enabled in the queue configuration. Enabling Metrics:- Completed jobs: Number of jobs completed per minute
- Failed jobs: Number of jobs failed per minute
- Historical data: Time-series data for trend analysis
Prometheus Integration
Bull Queue Exporter
Bull Queue Exporter is a Prometheus exporter for Bull queues that exposes queue metrics in Prometheus format. Installation:bull_queue_waiting: Number of jobs waiting to be processedbull_queue_active: Number of jobs currently being processedbull_queue_delayed: Number of delayed jobsbull_queue_failed: Number of failed jobsbull_queue_completed: Number of completed jobsbull_queue_paused: Whether the queue is paused (0 or 1)
Prometheus Configuration
Add a scrape target to yourprometheus.yml:
Example Prometheus Queries
Rate of completed jobs:Key Metrics to Monitor
Queue Health Indicators
1. Queue Length
Monitor the number of waiting and delayed jobs:- Queue length grows unbounded
- Backlog exceeds a threshold (e.g., 10,000 jobs)
- Queue growth rate is unusual
2. Processing Rate
Track jobs completed per time period:- Processing rate drops significantly
- Completion rate < arrival rate (queue growing)
- Zero throughput for extended period
3. Failure Rate
Monitor the ratio of failed to completed jobs:- Failure rate exceeds threshold (e.g., 5%)
- Sudden spike in failures
- Specific job types failing consistently
4. Active Jobs
Track jobs currently being processed:- Active jobs stuck at zero (workers not processing)
- Active count exceeds expected concurrency
- Low utilization despite large backlog
5. Job Duration
Track how long jobs take to complete:- Jobs taking significantly longer than usual
- Timeouts increasing
- P95/P99 latency degrading
Alerting Strategies
Critical Alerts
These require immediate attention:-
Queue Not Processing
- No jobs completed in last 5 minutes
- All workers down
- Redis connection lost
-
High Failure Rate
-
50% of jobs failing
- Sudden spike in failures
- Critical job type failing
-
-
Queue Overflow
- Queue length exceeding capacity
- Memory pressure on Redis
- Approaching rate limits
Warning Alerts
These indicate potential issues:-
Growing Backlog
- Queue length increasing over time
- Processing can’t keep up with additions
-
Elevated Failure Rate
- 5-50% of jobs failing
- Retry exhaustion increasing
-
Performance Degradation
- Job duration increasing
- Worker utilization low despite backlog
Custom Monitoring
Event-Based Monitoring
Listen to Bull events for real-time monitoring:Dashboard Examples
Grafana Dashboard Panels
-
Queue Overview
- Current queue length (gauge)
- Jobs processed (counter)
- Active workers (gauge)
- Failure rate (graph)
-
Performance Metrics
- Job duration percentiles (P50, P95, P99)
- Throughput over time
- Worker utilization
-
Error Tracking
- Failed jobs by type
- Stalled job count
- Error rate trends
-
Resource Usage
- Redis memory usage
- CPU usage per worker
- Network I/O
Best Practices
- Set Baseline Metrics: Understand normal behavior before setting alerts
- Use Multiple Alert Levels: Critical, warning, and info
- Monitor Trends: Look for gradual degradation, not just absolute values
- Track Business Metrics: Monitor job outcomes, not just technical metrics
- Test Alerting: Regularly verify alerts fire correctly
- Document Runbooks: Create procedures for common alert scenarios
- Correlate Metrics: Look at multiple signals together
- Set Up Dashboards: Create visualizations for quick troubleshooting