Skip to main content
Druid emits a comprehensive set of metrics that are essential for monitoring query execution, ingestion, coordination, and overall cluster health. All metrics can be configured through the metrics monitors configuration.

Metric Structure

All Druid metrics share a common set of fields:
timestamp
string
The time the metric was created
metric
string
The name of the metric
service
string
The service name that emitted the metric (e.g., “druid/broker”, “druid/historical”)
host
string
The host name that emitted the metric
value
number
The numeric value associated with the metric
Most metric values reset each emission period, as specified in druid.monitoring.emissionPeriod.

Query Metrics

Router Metrics

query/time
gauge
Milliseconds taken to complete a query.Dimensions: dataSource, type, interval, hasFilters, duration, context, remoteAddress, id, statusCodeNormal value: < 1s

Broker Metrics

The Broker emits detailed metrics about query processing and result merging.
MetricDescriptionNormal Value
query/timeMilliseconds taken to complete a query< 1s
query/bytesTotal bytes returned to clientVaries
query/node/timeMilliseconds to query individual Historical/Realtime processes< 1s
query/node/bytesBytes returned from individual Historical/Realtime processesVaries
query/node/ttfbTime to first byte from Historical/Realtime processes< 1s
query/countTotal number of queriesVaries
query/success/countNumber of successful queriesVaries
query/failed/countNumber of failed queriesShould be low
query/interrupted/countNumber of cancelled queriesShould be low
query/timeout/countNumber of timed out queriesShould be low
{
  "sqlQuery/time": {
    "description": "Milliseconds taken to complete a SQL query",
    "dimensions": ["id", "nativeQueryIds", "dataSource", "remoteAddress", "success", "engine", "statusCode"],
    "normalValue": "< 1s"
  },
  "sqlQuery/planningTimeMs": {
    "description": "Milliseconds taken to plan a SQL to native query",
    "dimensions": ["id", "nativeQueryIds", "dataSource", "remoteAddress", "success", "engine"]
  },
  "sqlQuery/bytes": {
    "description": "Number of bytes returned in the SQL query response",
    "dimensions": ["id", "nativeQueryIds", "dataSource", "remoteAddress", "success", "engine"]
  }
}
MetricDescription
query/cache/delta/*Cache metrics since the last emission
query/cache/total/*Total cache metrics
*/numEntriesNumber of cache entries
*/sizeBytesSize in bytes of cache entries
*/hitsNumber of cache hits
*/missesNumber of cache misses
*/evictionsNumber of cache evictions
*/hitRateCache hit rate (normal: ~40%)
*/errorsNumber of cache errors (should be 0)

Historical Metrics

# Segment query time
query/segment/time         # Time to query individual segment
query/wait/time           # Time waiting for segment scan
segment/scan/pending      # Segments waiting to be scanned
segment/scan/active       # Segments currently being scanned
If segment/scan/pending is consistently high, you may need to increase druid.processing.numThreads or add more Historicals.

Ingestion Metrics

General Native Ingestion

MetricDescriptionDimensions
ingest/countCount of ingestion jobsdataSource, taskId, taskType, taskIngestionMode
ingest/segments/countFinal segments createddataSource, taskId, taskType
ingest/tombstones/countTombstones createddataSource, taskId, taskType
ingest/events/processedEvents processed per emissiondataSource, taskId, taskType
ingest/events/unparseableUnparseable events rejecteddataSource, taskId, taskType

Ingestion Performance Metrics

# Persistence metrics
ingest/persists/count       # Number of persist operations
ingest/persists/time        # Time spent on persist
ingest/persists/backPressure # Time blocked on persist

# Handoff metrics
ingest/handoff/count        # Successful handoffs
ingest/handoff/failed       # Failed handoffs (should be 0)
ingest/handoff/time         # Time to complete handoff

# Processing metrics
ingest/merge/time           # Time merging segments
ingest/rows/output          # Druid rows persisted
Healthy Ingestion Indicators:
  • ingest/persists/backPressure is 0 or very low
  • ingest/handoff/failed is 0
  • ingest/events/unparseable is 0 or minimal

Coordination Metrics

Coordinator Metrics

segment/assigned/count
counter
Number of segments assigned to be loaded in the cluster.Dimensions: dataSource, tier
segment/moved/count
counter
Number of segments moved in the cluster.Dimensions: dataSource, tier
segment/dropped/count
counter
Number of segments dropped due to being over-replicated.Dimensions: dataSource, tier
segment/deleted/count
counter
Number of segments marked as unused due to drop rules.Dimensions: dataSource

Indexing Service Metrics

Task Metrics

  • task/run/time: Task execution time
  • task/pending/time: Time waiting to start
  • task/waiting/time: Time waiting for scheduling
  • task/success/count: Successful tasks
  • task/failed/count: Failed tasks

Slot Metrics

  • taskSlot/total/count: Total task slots
  • taskSlot/idle/count: Available slots
  • taskSlot/used/count: Busy slots
  • taskSlot/blacklisted/count: Blacklisted slots

System Metrics

JVM Metrics

Monitor JVM health with these critical metrics:
# Heap memory
jvm/mem/heap/used          # Heap memory in use
jvm/mem/heap/committed     # Committed heap memory
jvm/mem/heap/max           # Maximum heap memory

# Non-heap memory
jvm/mem/nonheap/used       # Non-heap memory in use
jvm/mem/nonheap/committed  # Committed non-heap memory

# Pools
jvm/pool/*/used            # Memory pool usage
jvm/pool/*/max             # Memory pool maximum
jvm/gc/count               # GC collection count
jvm/gc/time                # Time spent in GC
jvm/gc/cpu                 # CPU time spent in GC
If jvm/gc/time is consistently high (> 10% of total time), investigate heap sizing and GC tuning.

Jetty Server Metrics

MetricDescriptionNormal Value
jetty/numOpenConnectionsOpen connectionsNot much higher than thread count
jetty/threadPool/utilizedThreads in use< jetty/threadPool/ready
jetty/threadPool/utilizationRateThread pool utilization0.0 - 1.0
jetty/threadPool/idleIdle threads> 0 means spare capacity
jetty/threadPool/queueSizeQueued requestsShould be low
A jetty/threadPool/utilizationRate consistently near 1.0 indicates you should increase druid.server.http.numThreads.

Monitoring Configuration

Enable specific monitors by configuring druid.monitoring.monitors in common.runtime.properties:
# Common monitors
druid.monitoring.monitors=["org.apache.druid.java.util.metrics.JvmMonitor","org.apache.druid.java.util.metrics.SysMonitor"]

# Query metrics
druid.monitoring.monitors=["org.apache.druid.client.cache.CacheMonitor","org.apache.druid.server.metrics.QueryCountStatsMonitor"]

# Emission period (default: PT1M)
druid.monitoring.emissionPeriod=PT1M

JvmMonitor

JVM memory, GC, and buffer pool metrics

SysMonitor

CPU, disk, network, and memory system metrics

QueryCountStatsMonitor

Query success, failure, and timeout counts

CacheMonitor

Cache hit/miss rates and performance

Metrics Export

Druid can emit metrics to various monitoring systems:
# Load Prometheus emitter extension
druid.extensions.loadList=["prometheus-emitter"]

# Configure emitter
druid.emitter=prometheus
druid.emitter.prometheus.strategy=exporter
druid.emitter.prometheus.port=8000

Best Practices

1

Start with Core Metrics

Monitor query time, ingestion lag, and task success rates first
2

Set Up Alerting

Create alerts for failed tasks, high query times, and ingestion lag
3

Track Trends

Monitor metric trends over time to identify performance degradation
4

Correlate Metrics

Look at multiple metrics together (e.g., query time + JVM GC time)

Build docs developers (and LLMs) love