TSM Tree Storage

Overview

The TSM (Time-Structured Merge) tree is Slung’s storage engine, combining a write-optimized in-memory cache with compressed disk-backed entries. The architecture is inspired by LSM trees but optimized for time series data.

Architecture

TsmTree Structure

The TSM tree implementation (src/tsm/tsm.zig:10-234) is parametric over configuration:

pub fn TsmTreeImpl(
    comptime max_level: u64,      // Maximum tree depth
    comptime page_size: u32,      // Page size in bytes
    comptime ts_encoding: TimestampEncoding  // Encoding strategy
) type

Default configuration:

pub const TsmTree = TsmTreeImpl(
    100_000,  // max_level: 100K disk entries
    4096,     // page_size: 4KB pages
    .gorilla  // ts_encoding: Gorilla compression
);

Core Components

const Self = @This();
const MAX_CACHE_POINTS = 1_000_000;

allocator: Allocator,
name: []const u8,
entries: []*DiskEntry(page_size, ts_encoding),
entries_count: u64 = 0,
cache: *Cache(page_size, ts_encoding),

In-Memory Cache

Cache Architecture

The cache (src/tsm/cache.zig:13-109) uses skiplist-based storage for sorted time series data:

pub fn CacheImpl(
    comptime page_size: u32, 
    comptime ts_encoding: TimestampEncoding
) type {
    return struct {
        allocator: Allocator,
        index_series: HashMap(Skiplist(i64, Value, 16, std.Random.Pcg, ds.skiplist.compareI64)),
        bloom: Bloom(1024, Hasher),
        count: u64 = 0,
    };
}

Skiplist Storage

Each series gets its own skiplist with 16 levels:

Key: i64 timestamp (microseconds)
Value: Union of Bool/Int/Float/Bytes
Ordering: Timestamps sorted in ascending order
Insertion: O(log n) average case

Cache Insert

pub fn insert(self: *Self, series_key: []const u8, data_point: DataPoint) !void {
    const series = try self.index_series.getOrPut(series_key);
    if (!series.found_existing) {
        const owned_key = try self.allocator.dupe(u8, series_key);
        series.key_ptr.* = owned_key;
        const skiplist = try Skiplist(...).init(
            self.allocator, 
            @intCast(std.time.microTimestamp())
        );
        series.value_ptr.* = skiplist;
        self.bloom.insert(series_key);
    }
    _ = try series.value_ptr.*.insert(data_point.timestamp, data_point.value);
    self.count += 1;
}

Bloom Filter

A 1024-bit Bloom filter tracks which series are present:

bloom: Bloom(1024, Hasher)

pub fn mayContainSeries(self: *Self, series_key: []const u8) bool {
    return self.bloom.contains(series_key);
}

This enables fast negative lookups (“series definitely not present”).

Flush Process

Automatic Flush

The tree automatically flushes when cache exceeds 1M data points:

pub fn insert(self: *Self, series_key: []const u8, data_point: DataPoint) !void {
    try self.cache.insert(series_key, data_point);
    if (self.cache.count > MAX_CACHE_POINTS) try self.flush();
}

Flush Implementation

pub fn flush(self: *Self) !void {
    if (self.cache.count == 0) return;
    
    // Create disk entry from cache
    const d_entry = try self.cache.flush(self.name, self.entries_count + 1);
    self.entries[self.entries_count] = d_entry;
    self.entries_count += 1;
    
    // Clear cache (reusing allocations)
    var iter = self.cache.index_series.iterator();
    while (iter.next()) |kv| {
        kv.value_ptr.deinit();
        self.allocator.free(kv.key_ptr.*);
    }
    self.cache.index_series.clearRetainingCapacity();
    self.cache.bloom.reset();
    self.cache.count = 0;
}

Disk Entries

File Format

Each disk entry consists of two files:

_{level}_{name}.dat: Columnar data (timestamps and values)
_{level}_{name}.idx: Indexes and metadata

DiskEntry Structure

const Self = @This();
const MAGIC = "SLZ01";
const VERSION_DELTA_COMPRESSED = 1;
const VERSION_GORILLA = 2;

allocator: Allocator,
file_path: []const u8,
file_data: fs.File,
file_index: fs.File,
index_row: HashMap(u64),
index_column: HashMap(u64),
index_series: HashMap([2]u64),  // [start_row, end_row]
bloom: Bloom(1024, Hasher),
metadata: Metadata,
column_descriptors: []ColumnDescriptor,

Metadata

pub const Metadata = struct {
    number_rows: u64,
    number_columns: u32,
    created_at: i64,
    page_size: u32,
    version: u32,  // VERSION_DELTA_COMPRESSED or VERSION_GORILLA
    min_timestamp: i64,
    max_timestamp: i64,
};

Columnar Layout

Data is stored in two columns:

time: Timestamps (compressed)
value: Values (typed union)

Each series is a “page” within the column:

pub const PageDescriptor = struct {
    data_offset: u64,
    data_size: u32,
    start_row: u64,  // First row ID for this series
    end_row: u64,    // Last row ID for this series
};

Series Index

Maps series keys to row ranges:

index_series: HashMap([2]u64)  // Key: series_key, Value: [start_row, end_row]

Example:

"cpu.usage,host=server-01" -> [0, 999]
"cpu.usage,host=server-02" -> [1000, 1999]

Timestamp Encoding

Delta Encoding (VERSION_DELTA_COMPRESSED)

Uses zigzag varint encoding for deltas:

fn writeZigzagVarint(value: i64, buf: []u8) usize {
    const zigzag: u64 = @bitCast((value << 1) ^ (value >> 63));
    return writeVarint(zigzag, buf);
}

Encoding strategy:

First timestamp: Full 64 bits (with 0xFF marker)
Subsequent: Zigzag-encoded delta from previous

if (is_first_in_series) {
    serialize_buf[0] = 0xFF;  // Marker byte
    std.mem.writeInt(i64, serialize_buf[1..9], node.key, .little);
    serialized_len = 9;
} else {
    const delta = node.key - prev_timestamp;
    serialized_len = writeZigzagVarint(delta, &serialize_buf);
}

Gorilla Encoding (VERSION_GORILLA)

Implements Facebook’s Gorilla compression algorithm (src/tsm/gorilla.zig).

Delta-of-Delta for Timestamps

Encoding scheme (gorilla.zig:188-266):

// First timestamp: 64 bits
// Second timestamp: full delta (64 bits)
// Subsequent: delta-of-delta with variable encoding:

if (dod == 0) {
    // Case 1: dod == 0, write single '0' bit
    writer.writeBit(0);
} else if (dod >= -63 and dod <= 64) {
    // Case 2: '10' prefix + 7 bits
    writer.writeBits(0b10, 2);
    const encoded: u64 = @bitCast(@as(i64, dod) + 63);
    writer.writeBits(encoded, 7);
} else if (dod >= -255 and dod <= 256) {
    // Case 3: '110' prefix + 9 bits
    writer.writeBits(0b110, 3);
    const encoded: u64 = @bitCast(@as(i64, dod) + 255);
    writer.writeBits(encoded, 9);
} else if (dod >= -2047 and dod <= 2048) {
    // Case 4: '1110' prefix + 12 bits
    writer.writeBits(0b1110, 4);
    const encoded: u64 = @bitCast(@as(i64, dod) + 2047);
    writer.writeBits(encoded, 12);
} else {
    // Case 5: '1111' prefix + full 64-bit value
    writer.writeBits(0b1111, 4);
    writer.writeI64(dod);
}

For regular time series (e.g., 60-second intervals), delta-of-delta is typically 0, compressing to a single bit per timestamp!

XOR Encoding for Floats

Exploits temporal locality in floating-point values (gorilla.zig:359-452):

pub fn encode(self: *FloatEncoder, writer: *BitWriter, value: f64) void {
    const bits: u64 = @bitCast(value);
    
    if (self.count == 0) {
        // First value: full 64 bits
        writer.writeF64(value);
        self.prev_value = bits;
        self.count = 1;
        return;
    }
    
    const xor = bits ^ self.prev_value;
    
    if (xor == 0) {
        // Values identical: single '0' bit
        writer.writeBit(0);
    } else {
        writer.writeBit(1);
        
        const leading: u6 = @intCast(@clz(xor));
        const trailing: u6 = @intCast(@ctz(xor));
        
        // Check if we can reuse the previous window
        if (self.count > 1 and 
            leading >= self.prev_leading and 
            trailing >= self.prev_trailing) {
            // Reuse window: '0' control bit
            writer.writeBit(0);
            const meaningful_bits: u7 = 64 - @as(u7, self.prev_leading) - @as(u7, self.prev_trailing);
            const shifted = xor >> self.prev_trailing;
            writer.writeBits(shifted, @intCast(meaningful_bits));
        } else {
            // New window: '1' control bit + metadata
            writer.writeBit(1);
            writer.writeBits(leading, 5);  // 5 bits for leading zeros
            writer.writeBits(meaningful_bits - 1, 6);  // 6 bits for length
            writer.writeBits(shifted, @intCast(meaningful_bits));
        }
    }
}

Compression Ratios

From test suite (gorilla.zig:901-930):

// Regular 60s intervals, slowly changing values (1000 points)
const compressed_size = writer.bytesUsed();
const uncompressed_size = count * (8 + 8);  // 16 bytes per point

// Gorilla achieves > 2x compression for regular data
try testing.expect(compression_ratio > 2.0);

Query Execution

Query Path

Queries execute against both cache and disk:

pub fn query(
    self: *Self, 
    series_key: []const u8, 
    timestamp_start: i64, 
    timestamp_end: i64, 
    op: QueryOp
) !Value {
    // Query cache (hot path)
    const values_cache = self.queryCache(series_key, timestamp_start, timestamp_end) 
        catch try self.allocator.alloc(Value, 0);
    defer self.allocator.free(values_cache);
    
    // Query disk (cold path)
    const values_disk = self.queryDisk(series_key, timestamp_start, timestamp_end) 
        catch try self.allocator.alloc(Value, 0);
    defer self.allocator.free(values_disk);
    
    // Merge and aggregate
    const values = try std.mem.concat(self.allocator, Value, &.{ values_cache, values_disk });
    defer self.allocator.free(values);
    
    return switch (op) {
        .AVG => /* ... */,
        .MIN => /* ... */,
        .MAX => /* ... */,
        .SUM => /* ... */,
        .COUNT => /* ... */,
    };
}

Cache Query

Direct skiplist range scan:

pub fn getRange(
    self: *Self, 
    series_key: []const u8, 
    timestamp_start: i64, 
    timestamp_end: i64
) ![]Value {
    if (!self.bloom.contains(series_key)) {
        return error.SeriesNotFound;
    }
    
    const skiplist = self.index_series.get(series_key) 
        orelse return error.SeriesNotFound;
    
    var values: std.ArrayList(Value) = .empty;
    defer values.deinit(self.allocator);
    
    var iter = skiplist.head.next();
    while (iter) |node| : (iter = node.next()) {
        if (node.key >= timestamp_start and node.key <= timestamp_end) {
            try values.append(self.allocator, node.value);
        }
        if (node.key > timestamp_end) break;
    }
    
    return values.toOwnedSlice(self.allocator);
}

Disk Query

Optimized with metadata and Bloom filters (tsm.zig:86-122):

fn queryDisk(
    self: *Self, 
    series_key: []const u8, 
    timestamp_start: i64, 
    timestamp_end: i64
) ![]Value {
    var values_list: std.ArrayList(Value) = .empty;
    
    for (0..self.entries_count) |en_id| {
        const en = self.entries[en_id];
        
        // Skip entry if timestamp range doesn't overlap
        if (en.metadata.max_timestamp < timestamp_start or 
            en.metadata.min_timestamp > timestamp_end) {
            continue;
        }
        
        // Skip entry if series definitely not present (Bloom filter)
        if (!en.mayContainSeries(series_key)) {
            continue;
        }
        
        // Get row range for series
        const series_ids = en.index_series.get(series_key) orelse continue;
        
        // Read timestamps and filter by range
        const time_values = try en.getColumnRange(
            "time", 
            series_ids[0], 
            series_ids[1]
        );
        defer self.allocator.free(time_values);
        
        // Identify rows in range
        var timestamp_ids: [2]?u64 = .{ null, null };
        for (time_values, series_ids[0]..series_ids[1] + 1) |time, time_id| {
            if (time.Int >= timestamp_start and time.Int <= timestamp_end) {
                if (timestamp_ids[0] == null) timestamp_ids[0] = @intCast(time_id);
                timestamp_ids[1] = @intCast(time_id);
            }
        }
        
        // Read corresponding values
        const values_entry = try en.getColumnRange(
            "value", 
            timestamp_ids[0].?, 
            timestamp_ids[1].?
        );
        defer self.allocator.free(values_entry);
        try values_list.appendSlice(self.allocator, values_entry);
    }
    
    return values_list.toOwnedSlice(self.allocator);
}

Value Serialization

Values are serialized with a type tag (entry.zig:329-351):

[type:u8][data:bytes]

Type tags:

0: Bool (1 byte: 0/1)
1: Int (8 bytes: i64 little-endian)
2: Float (8 bytes: f64)
3: Bytes (4 bytes length + data)

Performance Optimizations

Page Descriptors Cache

Disk entries cache page descriptors to avoid repeated index reads:

cached_page_descriptors: []?[]PageDescriptor,

fn getOrLoadPageDescriptors(self: *Self, column_id: u64) ![]PageDescriptor {
    if (self.cached_page_descriptors[column_id]) |cached| {
        return cached;
    }
    
    // Load from disk and cache
    const page_descriptors = try self.loadPageDescriptors(column_id);
    self.cached_page_descriptors[column_id] = page_descriptors;
    return page_descriptors;
}

Sequential Decoding

Gorilla/delta encoding enables sequential decoding without random access:

if (is_time_column_gorilla) {
    const row_count = page_desc.end_row - page_desc.start_row + 1;
    const timestamps = try decodeGorillaTimestamps(
        self.allocator, 
        page_data, 
        @intCast(row_count)
    );
    
    for (timestamps, 0..) |ts, i| {
        const current_row = page_desc.start_row + i;
        if (current_row >= start_row and current_row <= end_row) {
            values[index_values] = Value{ .Int = ts };
            index_values += 1;
        }
    }
}

Configuration Trade-offs

Gorilla Encoding
Delta Encoding

Pros:

Best compression ratio (>2x for regular data)
Excellent for regular-interval time series
Streaming decompression

Cons:

Must decode sequentially from start
CPU overhead for encoding/decoding
Less efficient for irregular data

Next Steps

Data Model

Understand how series and tags work

WASM Workflows

Build real-time processing workflows

Get Started

Core Concepts

Ingestion

Querying

Workflows

Architecture

Documentation Index

​Overview

​Architecture

​TsmTree Structure

​Core Components

​In-Memory Cache

​Cache Architecture

​Skiplist Storage

​Cache Insert

​Bloom Filter

​Flush Process

​Automatic Flush

​Flush Implementation

​Disk Entries

​File Format

​DiskEntry Structure

​Metadata

​Columnar Layout

​Series Index

​Timestamp Encoding

​Delta Encoding (VERSION_DELTA_COMPRESSED)

​Gorilla Encoding (VERSION_GORILLA)

​Delta-of-Delta for Timestamps

​XOR Encoding for Floats

​Compression Ratios

​Query Execution

​Query Path

​Cache Query

​Disk Query

​Value Serialization

​Performance Optimizations

​Page Descriptors Cache

​Sequential Decoding

​Configuration Trade-offs

​Next Steps

Data Model

WASM Workflows

Build docs developers (and LLMs) love

Overview

Architecture

TsmTree Structure

Core Components

In-Memory Cache

Cache Architecture

Skiplist Storage

Cache Insert

Bloom Filter

Flush Process

Automatic Flush

Flush Implementation

Disk Entries

File Format

DiskEntry Structure

Metadata

Columnar Layout

Series Index

Timestamp Encoding

Delta Encoding (VERSION_DELTA_COMPRESSED)

Gorilla Encoding (VERSION_GORILLA)

Delta-of-Delta for Timestamps

XOR Encoding for Floats

Compression Ratios

Query Execution

Query Path

Cache Query

Disk Query

Value Serialization

Performance Optimizations

Page Descriptors Cache

Sequential Decoding

Configuration Trade-offs

Next Steps