Snuba provides extensive monitoring capabilities through metrics, health checks, and distributed tracing. This guide covers monitoring setup, key metrics, and operational alerts.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/getsentry/snuba/llms.txt
Use this file to discover all available pages before exploring further.
Metrics Architecture
Snuba emits metrics to DataDog (StatsD protocol) for monitoring and alerting.Metrics Configuration
Metric Types
Snuba emits four types of metrics:Counters
Counters
Incremental counters for event tracking:Key counters:
consumer.message_processed: Messages consumed from Kafkaconsumer.message_filtered: Messages filtered before processinghealthcheck_failed: Failed health checksquery.success: Successful queriesquery.error: Failed queries
Timers/Distributions
Timers/Distributions
Timing and distribution metrics:Key timers:
query.duration: Query execution timeconsumer.batch_time: Consumer batch processing timeclickhouse.query.duration: ClickHouse query execution timehealthcheck.latency: Health check response time
Gauges
Gauges
Point-in-time measurements:Key gauges:
consumer.lag: Kafka consumer lag in messagesconsumer.batch_size: Current batch sizeclickhouse.connections: Active ClickHouse connections
Sets
Sets
Unique value counting:
Health Monitoring
Health Check Endpoints
Snuba provides multiple health check endpoints:Basic Health Check
Quick sanity check that at least one ClickHouse node is responsive:- At least one ClickHouse cluster is reachable
- Can execute
SHOW TABLESquery - Timeout: 500ms per cluster (configurable)
Thorough Health Check
Comprehensive check that verifies all required tables exist:- All enabled storage tables exist in ClickHouse
- All query nodes are accessible
- All storage nodes are healthy
- Verifies table schema integrity
Envoy Health Check
Special endpoint for Envoy/load balancer integration:CLI Health Check
Run health checks from the command line:Health Check Configuration
Customize health check behavior via runtime config:Key Metrics to Monitor
API Performance Metrics
Query Rate
Monitor query throughput:Alert on:
- Sudden drops in query rate (> 50% decrease)
- High error rates (> 5% of total queries)
Query Latency
Track query performance:Alert on:
- P95 latency > 5 seconds
- P99 latency > 10 seconds
- Sudden latency spikes (> 3x baseline)
ClickHouse Query Performance
Monitor database execution time:Alert on:
- Average query time > 2 seconds
- P95 query time > 5 seconds
Consumer Pipeline Metrics
Consumer Lag
Most critical metric for data freshness:Alert on:
- Lag > 1,000,000 messages (warning)
- Lag > 5,000,000 messages (critical)
- Lag growing consistently over 15 minutes
Consumer lag indicates how far behind Kafka the consumer is. High lag means delayed data visibility.
Message Processing Rate
Track ingestion throughput:Alert on:
- Processing rate drops to 0 (consumer stuck)
- Processing rate < expected load
Batch Processing Time
Monitor consumer performance:Alert on:
- Average batch time > 5 seconds
- P95 batch time > 10 seconds
ClickHouse Health Metrics
Connection Pool
Monitor ClickHouse connections:Alert on:
- Connections at max pool size
- High connection wait times (> 100ms)
Query Failures
Track ClickHouse errors:Common error codes:
241: Memory limit exceeded159: Query timeout60: Too many simultaneous queries
System Resource Metrics
Production Alerts
Critical Alerts
Alerts that require immediate response:High API Error Rate
High API Error Rate
Alert: SLO - High API error rateThreshold: Error rate > 5% for 15 minutesResponse:
- Check ClickHouse health
- Review error logs for common patterns
- Check for recent deployments
- Verify network connectivity
Consumer Lag Critical
Consumer Lag Critical
Alert: Consumer lag exceeding thresholdThreshold: Lag > 5M messages for 15 minutesResponse:
- Scale consumer replicas
- Check consumer error rates
- Verify Kafka broker health
- Review ClickHouse insert performance
Pod Restart Loop
Pod Restart Loop
Alert: Too many restarts on Snuba podsThreshold: > 3 restarts in 5 minutesResponse:
- Check pod logs for crash reason
- Review resource limits (OOM?)
- Check health check configuration
- Verify dependencies (ClickHouse, Redis, Kafka)
ClickHouse Connection Failures
ClickHouse Connection Failures
Alert: Cannot connect to ClickHouseThreshold: > 10 failed health checks in 5 minutesResponse:
- Verify ClickHouse is running
- Check network connectivity
- Review ClickHouse logs
- Check authentication credentials
Warning Alerts
Alerts requiring investigation but not immediate action:DataDog Integration
Health Check Monitor
Create a DataDog monitor to check Snuba health:Custom Dashboards
Key dashboard widgets:Distributed Tracing
Sentry Integration
Snuba automatically sends traces to Sentry:Trace Context
Snuba includes rich context in traces:- Query: Full query text and parameters
- Dataset: Which dataset was queried
- Storage: Which ClickHouse storage was used
- Referrer: Query origin/caller
- Project IDs: Projects involved in query
- Timing Breakdown: Time spent in each phase
Performance Profiling
Enable profiling for performance debugging:Logging
Log Levels
Configure logging verbosity:Structured Logging
Snuba uses structured logging with key context:Important Log Patterns
Query Performance Monitoring
Slow Query Logging
Snuba automatically logs slow queries:Query Recording
Enable query recording for debugging:Cost of Goods Sold (COGS) Tracking
Track query costs:Operational Runbooks
High Consumer Lag
- Identify lag source: Check which storage has lag
- Scale consumers: Increase consumer replicas
- Check ClickHouse: Verify insert performance
- Review batch sizes: May need to increase batch size
- Check for slow queries: Blocking queries can delay inserts
API Error Rate Spike
- Check error types: Identify common error pattern
- Verify ClickHouse: Ensure database is healthy
- Review recent changes: Check for bad deployments
- Check rate limits: May need to adjust limits
- Monitor resource usage: CPU/memory exhaustion?
ClickHouse Connection Issues
- Verify connectivity: Test network path
- Check credentials: Ensure auth is correct
- Review connection pool: May need larger pool
- Check ClickHouse logs: Look for server errors
- Verify DNS resolution: Ensure hostname resolves
Best Practices
- Monitor consumer lag continuously - This is your most important metric
- Set up PagerDuty/OpsGenie integration for critical alerts
- Use log aggregation (Datadog Logs, Elasticsearch) for debugging
- Create custom dashboards for each team’s use cases
- Review slow query logs weekly to identify optimization opportunities
- Test alerts in staging before deploying to production
- Document runbooks for each critical alert
- Set up synthetic monitoring to test query paths proactively