Snuba’s storage layer is built on ClickHouse, a columnar database optimized for analytical queries on time-series data. This architecture provides the real-time performance and scalability required for Sentry’s query infrastructure.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/getsentry/snuba/llms.txt
Use this file to discover all available pages before exploring further.
Why ClickHouse?
ClickHouse was selected as Snuba’s backing storage for several key reasons:Real-Time Performance
Fast query execution on large datasets with columnar storage
Distributed Architecture
Built-in sharding and replication for horizontal scaling
Flexible Storage Engines
Multiple table engines for different consistency/performance tradeoffs
SQL Interface
Familiar query language with powerful extensions
ClickHouse Cluster Architecture
Snuba manages connections to ClickHouse through cluster abstractions:Cluster Types
Single Node Cluster
- One ClickHouse server instance
- Simplified operations and migrations
- No distributed tables required
- Suitable for development and small deployments
Multi-Node Cluster
- Multiple shards and replicas
- Requires
cluster_namefor local tables - Requires
distributed_cluster_namefor distributed tables - Enables horizontal scaling
A single proxy address is used for all read/write operations, but DDL operations must run on each individual node.
Cluster Configuration
Clusters are defined insettings.py:
Table Structure
Local vs Distributed Tables
In multi-node deployments, Snuba creates two types of tables:Local Tables
- Suffix:
_local - Physical storage on each shard
- Data written directly here
- Contains actual data files
Distributed Tables
- Suffix:
_dist - Virtual table that routes queries to local tables
- Queries distributed across all shards
- No data stored locally
Single-node deployments only create local tables. Distributed tables are unnecessary overhead.
Table Engines
ClickHouse provides multiple table engines. Snuba primarily uses these:ReplicatedReplacingMergeTree
Most common engine for Snuba tables:- Replication: Automatic data replication via ZooKeeper
- Deduplication: Removes duplicate rows with same primary key
- Replacement: Keeps row with highest version column value
- Eventual consistency: Deduplication happens during merges
ReplicatedMergeTree
Basic replicated storage without deduplication:- Replication via ZooKeeper
- No automatic deduplication
- Faster inserts than ReplacingMergeTree
ReplicatedAggregatingMergeTree
Pre-aggregates data during merges:- Stores intermediate aggregation states
- Combines states during merges
- Requires AggregateFunction column types
Materialized Views
Creates derived tables that auto-update:Storage Schema
Snuba storages are defined in YAML configuration:Key Schema Elements
Partition Format
Defines how data is partitioned on disk:- Efficient data deletion (drop entire partitions)
- Pruning irrelevant partitions during queries
- Parallel processing across partitions
Order By / Primary Key
Defines sort order and primary key (from DDL):- Determines data locality on disk
- Enables efficient range scans
- Critical for query performance
Choose ORDER BY columns based on most common query patterns. First columns should be most selective.
Data Types
Snuba uses ClickHouse’s rich type system:Primitive Types
Complex Types
Nested
Stores arrays of structs:tags.key and tags.value
Array
Homogeneous arrays:Nullable
Allows NULL values (use sparingly):Connection Management
Snuba maintains connection pools for efficiency:Client Settings
Different operations use different ClickHouse settings:Batch Writing
Snuba writes to ClickHouse in batches for efficiency:- Chunk size: Number of rows per HTTP request
- Buffer size: Memory buffer for accumulating rows
- Max connections: Connection pool size
Storage Sets
Storages are grouped into Storage Sets that share a cluster:- Co-locate related storages on same cluster
- Share connection pools and resources
- Enable cross-storage optimizations
Performance Considerations
Query Optimization
- PREWHERE clause: Filter before reading all columns
- Index usage: Leverage ORDER BY for range scans
- Partition pruning: Filter by partition key columns
- Sampling: Use SAMPLE clause for approximate queries
Write Optimization
- Batch inserts: Write thousands of rows per request
- Partition alignment: Respect partition boundaries
- Deduplication: Rely on ReplacingMergeTree, not application
Storage Optimization
- Compression: ClickHouse compresses data automatically
- Materialized views: Pre-aggregate common queries
- TTL policies: Auto-delete old partitions
- Column selection: Only include necessary columns
ClickHouse’s columnar storage means queries only read columns they need. Add columns liberally without significant cost.
Related Topics
- Data Model - Understanding Storages in the data model
- Ingestion - How data is written to storage
- Query Processing - How storage is queried
- Slicing - Multi-tenant storage configuration