Overview
The TSM (Time-Structured Merge) tree is Slung’s storage engine, combining a write-optimized in-memory cache with compressed disk-backed entries. The architecture is inspired by LSM trees but optimized for time series data.
Architecture
TsmTree Structure
The TSM tree implementation (src/tsm/tsm.zig:10-234) is parametric over configuration:
pub fn TsmTreeImpl (
comptime max_level : u64 , // Maximum tree depth
comptime page_size : u32 , // Page size in bytes
comptime ts_encoding : TimestampEncoding // Encoding strategy
) type
Default configuration:
pub const TsmTree = TsmTreeImpl (
100_000 , // max_level: 100K disk entries
4096 , // page_size: 4KB pages
. gorilla // ts_encoding: Gorilla compression
);
Core Components
const Self = @This ();
const MAX_CACHE_POINTS = 1_000_000 ;
allocator : Allocator ,
name : [] const u8 ,
entries : [] * DiskEntry ( page_size , ts_encoding ),
entries_count : u64 = 0 ,
cache : * Cache ( page_size , ts_encoding ),
In-Memory Cache
Cache Architecture
The cache (src/tsm/cache.zig:13-109) uses skiplist-based storage for sorted time series data:
pub fn CacheImpl (
comptime page_size : u32 ,
comptime ts_encoding : TimestampEncoding
) type {
return struct {
allocator : Allocator ,
index_series : HashMap ( Skiplist ( i64 , Value , 16 , std . Random . Pcg , ds . skiplist . compareI64 )),
bloom : Bloom ( 1024 , Hasher ),
count : u64 = 0 ,
};
}
Skiplist Storage
Each series gets its own skiplist with 16 levels:
Key : i64 timestamp (microseconds)
Value : Union of Bool/Int/Float/Bytes
Ordering : Timestamps sorted in ascending order
Insertion : O(log n) average case
Cache Insert
pub fn insert ( self : * Self , series_key : [] const u8 , data_point : DataPoint ) ! void {
const series = try self . index_series . getOrPut ( series_key );
if ( ! series . found_existing ) {
const owned_key = try self . allocator . dupe ( u8 , series_key );
series . key_ptr . * = owned_key ;
const skiplist = try Skiplist (...). init (
self . allocator ,
@intCast ( std . time . microTimestamp ())
);
series . value_ptr . * = skiplist ;
self . bloom . insert ( series_key );
}
_ = try series . value_ptr . * . insert ( data_point . timestamp , data_point . value );
self . count += 1 ;
}
Bloom Filter
A 1024-bit Bloom filter tracks which series are present:
bloom : Bloom ( 1024 , Hasher )
pub fn mayContainSeries ( self : * Self , series_key : [] const u8 ) bool {
return self . bloom . contains ( series_key );
}
This enables fast negative lookups (“series definitely not present”).
Flush Process
Automatic Flush
The tree automatically flushes when cache exceeds 1M data points:
pub fn insert ( self : * Self , series_key : [] const u8 , data_point : DataPoint ) ! void {
try self . cache . insert ( series_key , data_point );
if ( self . cache . count > MAX_CACHE_POINTS ) try self . flush ();
}
Flush Implementation
pub fn flush ( self : * Self ) ! void {
if ( self . cache . count == 0 ) return ;
// Create disk entry from cache
const d_entry = try self . cache . flush ( self . name , self . entries_count + 1 );
self . entries [ self . entries_count ] = d_entry ;
self . entries_count += 1 ;
// Clear cache (reusing allocations)
var iter = self . cache . index_series . iterator ();
while ( iter . next ()) | kv | {
kv . value_ptr . deinit ();
self . allocator . free ( kv . key_ptr . * );
}
self . cache . index_series . clearRetainingCapacity ();
self . cache . bloom . reset ();
self . cache . count = 0 ;
}
Disk Entries
Each disk entry consists of two files:
_{level}_{name}.dat: Columnar data (timestamps and values)
_{level}_{name}.idx: Indexes and metadata
DiskEntry Structure
const Self = @This ();
const MAGIC = "SLZ01" ;
const VERSION_DELTA_COMPRESSED = 1 ;
const VERSION_GORILLA = 2 ;
allocator : Allocator ,
file_path : [] const u8 ,
file_data : fs . File ,
file_index : fs . File ,
index_row : HashMap ( u64 ),
index_column : HashMap ( u64 ),
index_series : HashMap ([ 2 ] u64 ), // [start_row, end_row]
bloom : Bloom ( 1024 , Hasher ),
metadata : Metadata ,
column_descriptors : [] ColumnDescriptor ,
pub const Metadata = struct {
number_rows : u64 ,
number_columns : u32 ,
created_at : i64 ,
page_size : u32 ,
version : u32 , // VERSION_DELTA_COMPRESSED or VERSION_GORILLA
min_timestamp : i64 ,
max_timestamp : i64 ,
};
Columnar Layout
Data is stored in two columns:
time : Timestamps (compressed)
value : Values (typed union)
Each series is a “page” within the column:
pub const PageDescriptor = struct {
data_offset : u64 ,
data_size : u32 ,
start_row : u64 , // First row ID for this series
end_row : u64 , // Last row ID for this series
};
Series Index
Maps series keys to row ranges:
index_series : HashMap ([ 2 ] u64 ) // Key: series_key, Value: [start_row, end_row]
Example:
"cpu.usage,host=server-01" -> [0, 999]
"cpu.usage,host=server-02" -> [1000, 1999]
Timestamp Encoding
Delta Encoding (VERSION_DELTA_COMPRESSED)
Uses zigzag varint encoding for deltas:
fn writeZigzagVarint ( value : i64 , buf : [] u8 ) usize {
const zigzag : u64 = @bitCast (( value << 1 ) ^ ( value >> 63 ));
return writeVarint ( zigzag , buf );
}
Encoding strategy:
First timestamp: Full 64 bits (with 0xFF marker)
Subsequent: Zigzag-encoded delta from previous
if ( is_first_in_series ) {
serialize_buf [ 0 ] = 0xFF ; // Marker byte
std . mem . writeInt ( i64 , serialize_buf [ 1 .. 9 ], node . key , . little );
serialized_len = 9 ;
} else {
const delta = node . key - prev_timestamp ;
serialized_len = writeZigzagVarint ( delta , & serialize_buf );
}
Gorilla Encoding (VERSION_GORILLA)
Implements Facebook’s Gorilla compression algorithm (src/tsm/gorilla.zig).
Delta-of-Delta for Timestamps
Encoding scheme (gorilla.zig:188-266):
// First timestamp: 64 bits
// Second timestamp: full delta (64 bits)
// Subsequent: delta-of-delta with variable encoding:
if ( dod == 0 ) {
// Case 1: dod == 0, write single '0' bit
writer . writeBit ( 0 );
} else if ( dod >= - 63 and dod <= 64 ) {
// Case 2: '10' prefix + 7 bits
writer . writeBits ( 0b10 , 2 );
const encoded : u64 = @bitCast ( @as ( i64 , dod ) + 63 );
writer . writeBits ( encoded , 7 );
} else if ( dod >= - 255 and dod <= 256 ) {
// Case 3: '110' prefix + 9 bits
writer . writeBits ( 0b110 , 3 );
const encoded : u64 = @bitCast ( @as ( i64 , dod ) + 255 );
writer . writeBits ( encoded , 9 );
} else if ( dod >= - 2047 and dod <= 2048 ) {
// Case 4: '1110' prefix + 12 bits
writer . writeBits ( 0b1110 , 4 );
const encoded : u64 = @bitCast ( @as ( i64 , dod ) + 2047 );
writer . writeBits ( encoded , 12 );
} else {
// Case 5: '1111' prefix + full 64-bit value
writer . writeBits ( 0b1111 , 4 );
writer . writeI64 ( dod );
}
For regular time series (e.g., 60-second intervals), delta-of-delta is typically 0, compressing to a single bit per timestamp!
XOR Encoding for Floats
Exploits temporal locality in floating-point values (gorilla.zig:359-452):
pub fn encode ( self : * FloatEncoder , writer : * BitWriter , value : f64 ) void {
const bits : u64 = @bitCast ( value );
if ( self . count == 0 ) {
// First value: full 64 bits
writer . writeF64 ( value );
self . prev_value = bits ;
self . count = 1 ;
return ;
}
const xor = bits ^ self . prev_value ;
if ( xor == 0 ) {
// Values identical: single '0' bit
writer . writeBit ( 0 );
} else {
writer . writeBit ( 1 );
const leading : u6 = @intCast ( @clz ( xor ));
const trailing : u6 = @intCast ( @ctz ( xor ));
// Check if we can reuse the previous window
if ( self . count > 1 and
leading >= self . prev_leading and
trailing >= self . prev_trailing ) {
// Reuse window: '0' control bit
writer . writeBit ( 0 );
const meaningful_bits : u7 = 64 - @as ( u7 , self . prev_leading ) - @as ( u7 , self . prev_trailing );
const shifted = xor >> self . prev_trailing ;
writer . writeBits ( shifted , @intCast ( meaningful_bits ));
} else {
// New window: '1' control bit + metadata
writer . writeBit ( 1 );
writer . writeBits ( leading , 5 ); // 5 bits for leading zeros
writer . writeBits ( meaningful_bits - 1 , 6 ); // 6 bits for length
writer . writeBits ( shifted , @intCast ( meaningful_bits ));
}
}
}
Compression Ratios
From test suite (gorilla.zig:901-930):
// Regular 60s intervals, slowly changing values (1000 points)
const compressed_size = writer . bytesUsed ();
const uncompressed_size = count * ( 8 + 8 ); // 16 bytes per point
// Gorilla achieves > 2x compression for regular data
try testing . expect ( compression_ratio > 2.0 );
Query Execution
Query Path
Queries execute against both cache and disk:
pub fn query (
self : * Self ,
series_key : [] const u8 ,
timestamp_start : i64 ,
timestamp_end : i64 ,
op : QueryOp
) ! Value {
// Query cache (hot path)
const values_cache = self . queryCache ( series_key , timestamp_start , timestamp_end )
catch try self . allocator . alloc ( Value , 0 );
defer self . allocator . free ( values_cache );
// Query disk (cold path)
const values_disk = self . queryDisk ( series_key , timestamp_start , timestamp_end )
catch try self . allocator . alloc ( Value , 0 );
defer self . allocator . free ( values_disk );
// Merge and aggregate
const values = try std . mem . concat ( self . allocator , Value , & .{ values_cache , values_disk });
defer self . allocator . free ( values );
return switch ( op ) {
. AVG => /* ... */ ,
. MIN => /* ... */ ,
. MAX => /* ... */ ,
. SUM => /* ... */ ,
. COUNT => /* ... */ ,
};
}
Cache Query
Direct skiplist range scan:
pub fn getRange (
self : * Self ,
series_key : [] const u8 ,
timestamp_start : i64 ,
timestamp_end : i64
) ! [] Value {
if ( ! self . bloom . contains ( series_key )) {
return error . SeriesNotFound ;
}
const skiplist = self . index_series . get ( series_key )
orelse return error . SeriesNotFound ;
var values : std . ArrayList ( Value ) = . empty ;
defer values . deinit ( self . allocator );
var iter = skiplist . head . next ();
while ( iter ) | node | : ( iter = node . next ()) {
if ( node . key >= timestamp_start and node . key <= timestamp_end ) {
try values . append ( self . allocator , node . value );
}
if ( node . key > timestamp_end ) break ;
}
return values . toOwnedSlice ( self . allocator );
}
Disk Query
Optimized with metadata and Bloom filters (tsm.zig:86-122):
fn queryDisk (
self : * Self ,
series_key : [] const u8 ,
timestamp_start : i64 ,
timestamp_end : i64
) ! [] Value {
var values_list : std . ArrayList ( Value ) = . empty ;
for ( 0 .. self . entries_count ) | en_id | {
const en = self . entries [ en_id ];
// Skip entry if timestamp range doesn't overlap
if ( en . metadata . max_timestamp < timestamp_start or
en . metadata . min_timestamp > timestamp_end ) {
continue ;
}
// Skip entry if series definitely not present (Bloom filter)
if ( ! en . mayContainSeries ( series_key )) {
continue ;
}
// Get row range for series
const series_ids = en . index_series . get ( series_key ) orelse continue ;
// Read timestamps and filter by range
const time_values = try en . getColumnRange (
"time" ,
series_ids [ 0 ],
series_ids [ 1 ]
);
defer self . allocator . free ( time_values );
// Identify rows in range
var timestamp_ids : [ 2 ] ? u64 = .{ null , null };
for ( time_values , series_ids [ 0 ].. series_ids [ 1 ] + 1 ) | time , time_id | {
if ( time . Int >= timestamp_start and time . Int <= timestamp_end ) {
if ( timestamp_ids [ 0 ] == null ) timestamp_ids [ 0 ] = @intCast ( time_id );
timestamp_ids [ 1 ] = @intCast ( time_id );
}
}
// Read corresponding values
const values_entry = try en . getColumnRange (
"value" ,
timestamp_ids [ 0 ]. ? ,
timestamp_ids [ 1 ]. ?
);
defer self . allocator . free ( values_entry );
try values_list . appendSlice ( self . allocator , values_entry );
}
return values_list . toOwnedSlice ( self . allocator );
}
Value Serialization
Values are serialized with a type tag (entry.zig:329-351):
Type tags:
0: Bool (1 byte: 0/1)
1: Int (8 bytes: i64 little-endian)
2: Float (8 bytes: f64)
3: Bytes (4 bytes length + data)
Page Descriptors Cache
Disk entries cache page descriptors to avoid repeated index reads:
cached_page_descriptors : [] ? [] PageDescriptor ,
fn getOrLoadPageDescriptors ( self : * Self , column_id : u64 ) ! [] PageDescriptor {
if ( self . cached_page_descriptors [ column_id ]) | cached | {
return cached ;
}
// Load from disk and cache
const page_descriptors = try self . loadPageDescriptors ( column_id );
self . cached_page_descriptors [ column_id ] = page_descriptors ;
return page_descriptors ;
}
Sequential Decoding
Gorilla/delta encoding enables sequential decoding without random access:
if ( is_time_column_gorilla ) {
const row_count = page_desc . end_row - page_desc . start_row + 1 ;
const timestamps = try decodeGorillaTimestamps (
self . allocator ,
page_data ,
@intCast ( row_count )
);
for ( timestamps , 0 ..) | ts , i | {
const current_row = page_desc . start_row + i ;
if ( current_row >= start_row and current_row <= end_row ) {
values [ index_values ] = Value { . Int = ts };
index_values += 1 ;
}
}
}
Configuration Trade-offs
Gorilla Encoding
Delta Encoding
Pros:
Best compression ratio (>2x for regular data)
Excellent for regular-interval time series
Streaming decompression
Cons:
Must decode sequentially from start
CPU overhead for encoding/decoding
Less efficient for irregular data
Pros:
Faster query execution
Simpler implementation
Works well with irregular intervals
Cons:
Lower compression ratio
Still requires sequential decoding for deltas
More disk I/O
Next Steps
Data Model Understand how series and tags work
WASM Workflows Build real-time processing workflows