Druid’s Data Model Overview
Datasources
Similar to tables in a traditional RDBMS
Rollup
Optional partial aggregation during ingestion
Time-based
Every row must have a timestamp
OLAP
Columns are either dimensions or metrics
For general information, see the Schema Model documentation.
Coming from Other Systems
Relational Databases (RDBMS)
Key Differences
Key Differences
Denormalization is RecommendedUnlike relational databases where normalization is best practice, Druid benefits from denormalized data.Requires JOIN at query time
- Traditional RDBMS
- Druid (Recommended)
Why Denormalization Works
Why Denormalization Works
Storage EfficiencyDruid uses dictionary encoding to store string columns efficiently. Each unique value is stored once, and rows contain integer references.Query Performance
- No JOIN operations needed at query time
- Direct access to all data
- Operates on compressed dictionary-encoded data
When to Use Lookups
When to Use Lookups
Use lookups (similar to dimension tables) when:
- You need to update dimension values frequently
- Changes should reflect immediately in already-ingested data
- Memory footprint is acceptable (full copy on each server)
Tips for Relational Data
Pre-join Large Tables
Join two large distributed tables before loading into Druid (Druid doesn’t support query-time joins of large datasources)
Time Series Databases
Key Differences
Key Differences
Data Point vs Series ModelDruid treats each point separately rather than as part of a “time series”.
Tips for Time Series Data
Metric Name Dimension
Metric Name Dimension
Create a dimension for the series name:
Place the metric name first in the dimensions list for better locality and performance.
Tags as Dimensions
Tags as Dimensions
Configure Metrics
Configure Metrics
Define aggregations for the types of queries you’ll run:
Enable rollup to combine data points at different time granularities or to merge timeseries and non-timeseries data.
Log Aggregation Systems
Key Differences
Key Differences
Explicit Schema vs Schema-lessUnlike Elasticsearch or Splunk, Druid requires more explicit schema definition upfront.
Tips for Log Data
Schema Discovery
Schema Discovery
If you don’t know the columns ahead of time, use automatic schema discovery:
Nested Data
Nested Data
For nested JSON logs, you have two options:
- Nested Columns
- Flatten
Use Druid’s native nested column support:
Rollup for Analytics
Rollup for Analytics
Enable rollup if you have analytical use cases:
General Best Practices
Partitioning and Sorting
Optimize Partitioning
Proper partitioning and sorting can substantially impact footprint and performance.
- Time-based
- Secondary Partitioning
Choose appropriate segment granularity:
- High-volume data:
HOUR - Medium-volume:
DAY - Low-volume:
WEEKorMONTH
High Cardinality Columns
Use Sketches
Use Sketches
For high cardinality columns like user IDs, use sketches for approximate analysis:Benefits:
- HyperLogLog
- Theta Sketch
- Quantiles
For count-distinct queries:
- Improved rollup ratios (collapse multiple distinct values)
- Reduced memory footprint at query time
- Faster aggregation of approximate results
String vs Numeric Dimensions
- Performance Trade-offs
- Example
Numeric Dimensions (Long, Double, Float)✅ Faster to group on
❌ Slower to filter on (no indexes)
✅ Less memory usageString Dimensions✅ Faster to filter on (with indexes)
❌ Slower to group on
❌ More memory usage
❌ Slower to filter on (no indexes)
✅ Less memory usageString Dimensions✅ Faster to filter on (with indexes)
❌ Slower to group on
❌ More memory usage
Secondary Timestamps
Store as Long Dimensions
Store as Long Dimensions
If you have multiple timestamps, store additional ones as long-typed dimensions:
Convert with Transform
Convert with Transform
Use
transformSpec to convert timestamps to milliseconds:Query Secondary Timestamps
Query Secondary Timestamps
Use SQL time functions at query time:
Nested Dimensions
- Native Support
- Flatten
Use Query with JSON functions:
COMPLEX<json> for nested data:Counting Ingested Events
- At Ingestion
- At Query Time
Use a count metric during ingestion:
Schema Auto-Discovery
- Type-Aware
- String-Based
Recommended for most use casesDruid infers:
STRING,LONG,DOUBLEARRAY<STRING>,ARRAY<LONG>,ARRAY<DOUBLE>COMPLEX<json>for nested data
Same Column as Dimension and Metric
Use Case: Unique IDs
Use Case: Unique IDs
To filter on an ID while also computing unique counts:
Complete Example
Next Steps
Schema Model
Learn about the core schema concepts
Partitioning
Optimize partitioning for your use case
Data Rollup
Understand rollup in detail
Nested Columns
Work with nested JSON data