Skip to main content

Overview

The TSM (Time-Structured Merge) tree is Slung’s storage engine, combining a write-optimized in-memory cache with compressed disk-backed entries. The architecture is inspired by LSM trees but optimized for time series data.

Architecture

TsmTree Structure

The TSM tree implementation (src/tsm/tsm.zig:10-234) is parametric over configuration:
pub fn TsmTreeImpl(
    comptime max_level: u64,      // Maximum tree depth
    comptime page_size: u32,      // Page size in bytes
    comptime ts_encoding: TimestampEncoding  // Encoding strategy
) type
Default configuration:
pub const TsmTree = TsmTreeImpl(
    100_000,  // max_level: 100K disk entries
    4096,     // page_size: 4KB pages
    .gorilla  // ts_encoding: Gorilla compression
);

Core Components

const Self = @This();
const MAX_CACHE_POINTS = 1_000_000;

allocator: Allocator,
name: []const u8,
entries: []*DiskEntry(page_size, ts_encoding),
entries_count: u64 = 0,
cache: *Cache(page_size, ts_encoding),

In-Memory Cache

Cache Architecture

The cache (src/tsm/cache.zig:13-109) uses skiplist-based storage for sorted time series data:
pub fn CacheImpl(
    comptime page_size: u32, 
    comptime ts_encoding: TimestampEncoding
) type {
    return struct {
        allocator: Allocator,
        index_series: HashMap(Skiplist(i64, Value, 16, std.Random.Pcg, ds.skiplist.compareI64)),
        bloom: Bloom(1024, Hasher),
        count: u64 = 0,
    };
}

Skiplist Storage

Each series gets its own skiplist with 16 levels:
  • Key: i64 timestamp (microseconds)
  • Value: Union of Bool/Int/Float/Bytes
  • Ordering: Timestamps sorted in ascending order
  • Insertion: O(log n) average case

Cache Insert

pub fn insert(self: *Self, series_key: []const u8, data_point: DataPoint) !void {
    const series = try self.index_series.getOrPut(series_key);
    if (!series.found_existing) {
        const owned_key = try self.allocator.dupe(u8, series_key);
        series.key_ptr.* = owned_key;
        const skiplist = try Skiplist(...).init(
            self.allocator, 
            @intCast(std.time.microTimestamp())
        );
        series.value_ptr.* = skiplist;
        self.bloom.insert(series_key);
    }
    _ = try series.value_ptr.*.insert(data_point.timestamp, data_point.value);
    self.count += 1;
}

Bloom Filter

A 1024-bit Bloom filter tracks which series are present:
bloom: Bloom(1024, Hasher)

pub fn mayContainSeries(self: *Self, series_key: []const u8) bool {
    return self.bloom.contains(series_key);
}
This enables fast negative lookups (“series definitely not present”).

Flush Process

Automatic Flush

The tree automatically flushes when cache exceeds 1M data points:
pub fn insert(self: *Self, series_key: []const u8, data_point: DataPoint) !void {
    try self.cache.insert(series_key, data_point);
    if (self.cache.count > MAX_CACHE_POINTS) try self.flush();
}

Flush Implementation

pub fn flush(self: *Self) !void {
    if (self.cache.count == 0) return;
    
    // Create disk entry from cache
    const d_entry = try self.cache.flush(self.name, self.entries_count + 1);
    self.entries[self.entries_count] = d_entry;
    self.entries_count += 1;
    
    // Clear cache (reusing allocations)
    var iter = self.cache.index_series.iterator();
    while (iter.next()) |kv| {
        kv.value_ptr.deinit();
        self.allocator.free(kv.key_ptr.*);
    }
    self.cache.index_series.clearRetainingCapacity();
    self.cache.bloom.reset();
    self.cache.count = 0;
}

Disk Entries

File Format

Each disk entry consists of two files:
  • _{level}_{name}.dat: Columnar data (timestamps and values)
  • _{level}_{name}.idx: Indexes and metadata

DiskEntry Structure

const Self = @This();
const MAGIC = "SLZ01";
const VERSION_DELTA_COMPRESSED = 1;
const VERSION_GORILLA = 2;

allocator: Allocator,
file_path: []const u8,
file_data: fs.File,
file_index: fs.File,
index_row: HashMap(u64),
index_column: HashMap(u64),
index_series: HashMap([2]u64),  // [start_row, end_row]
bloom: Bloom(1024, Hasher),
metadata: Metadata,
column_descriptors: []ColumnDescriptor,

Metadata

pub const Metadata = struct {
    number_rows: u64,
    number_columns: u32,
    created_at: i64,
    page_size: u32,
    version: u32,  // VERSION_DELTA_COMPRESSED or VERSION_GORILLA
    min_timestamp: i64,
    max_timestamp: i64,
};

Columnar Layout

Data is stored in two columns:
  1. time: Timestamps (compressed)
  2. value: Values (typed union)
Each series is a “page” within the column:
pub const PageDescriptor = struct {
    data_offset: u64,
    data_size: u32,
    start_row: u64,  // First row ID for this series
    end_row: u64,    // Last row ID for this series
};

Series Index

Maps series keys to row ranges:
index_series: HashMap([2]u64)  // Key: series_key, Value: [start_row, end_row]
Example:
"cpu.usage,host=server-01" -> [0, 999]
"cpu.usage,host=server-02" -> [1000, 1999]

Timestamp Encoding

Delta Encoding (VERSION_DELTA_COMPRESSED)

Uses zigzag varint encoding for deltas:
fn writeZigzagVarint(value: i64, buf: []u8) usize {
    const zigzag: u64 = @bitCast((value << 1) ^ (value >> 63));
    return writeVarint(zigzag, buf);
}
Encoding strategy:
  • First timestamp: Full 64 bits (with 0xFF marker)
  • Subsequent: Zigzag-encoded delta from previous
if (is_first_in_series) {
    serialize_buf[0] = 0xFF;  // Marker byte
    std.mem.writeInt(i64, serialize_buf[1..9], node.key, .little);
    serialized_len = 9;
} else {
    const delta = node.key - prev_timestamp;
    serialized_len = writeZigzagVarint(delta, &serialize_buf);
}

Gorilla Encoding (VERSION_GORILLA)

Implements Facebook’s Gorilla compression algorithm (src/tsm/gorilla.zig).

Delta-of-Delta for Timestamps

Encoding scheme (gorilla.zig:188-266):
// First timestamp: 64 bits
// Second timestamp: full delta (64 bits)
// Subsequent: delta-of-delta with variable encoding:

if (dod == 0) {
    // Case 1: dod == 0, write single '0' bit
    writer.writeBit(0);
} else if (dod >= -63 and dod <= 64) {
    // Case 2: '10' prefix + 7 bits
    writer.writeBits(0b10, 2);
    const encoded: u64 = @bitCast(@as(i64, dod) + 63);
    writer.writeBits(encoded, 7);
} else if (dod >= -255 and dod <= 256) {
    // Case 3: '110' prefix + 9 bits
    writer.writeBits(0b110, 3);
    const encoded: u64 = @bitCast(@as(i64, dod) + 255);
    writer.writeBits(encoded, 9);
} else if (dod >= -2047 and dod <= 2048) {
    // Case 4: '1110' prefix + 12 bits
    writer.writeBits(0b1110, 4);
    const encoded: u64 = @bitCast(@as(i64, dod) + 2047);
    writer.writeBits(encoded, 12);
} else {
    // Case 5: '1111' prefix + full 64-bit value
    writer.writeBits(0b1111, 4);
    writer.writeI64(dod);
}
For regular time series (e.g., 60-second intervals), delta-of-delta is typically 0, compressing to a single bit per timestamp!

XOR Encoding for Floats

Exploits temporal locality in floating-point values (gorilla.zig:359-452):
pub fn encode(self: *FloatEncoder, writer: *BitWriter, value: f64) void {
    const bits: u64 = @bitCast(value);
    
    if (self.count == 0) {
        // First value: full 64 bits
        writer.writeF64(value);
        self.prev_value = bits;
        self.count = 1;
        return;
    }
    
    const xor = bits ^ self.prev_value;
    
    if (xor == 0) {
        // Values identical: single '0' bit
        writer.writeBit(0);
    } else {
        writer.writeBit(1);
        
        const leading: u6 = @intCast(@clz(xor));
        const trailing: u6 = @intCast(@ctz(xor));
        
        // Check if we can reuse the previous window
        if (self.count > 1 and 
            leading >= self.prev_leading and 
            trailing >= self.prev_trailing) {
            // Reuse window: '0' control bit
            writer.writeBit(0);
            const meaningful_bits: u7 = 64 - @as(u7, self.prev_leading) - @as(u7, self.prev_trailing);
            const shifted = xor >> self.prev_trailing;
            writer.writeBits(shifted, @intCast(meaningful_bits));
        } else {
            // New window: '1' control bit + metadata
            writer.writeBit(1);
            writer.writeBits(leading, 5);  // 5 bits for leading zeros
            writer.writeBits(meaningful_bits - 1, 6);  // 6 bits for length
            writer.writeBits(shifted, @intCast(meaningful_bits));
        }
    }
}

Compression Ratios

From test suite (gorilla.zig:901-930):
// Regular 60s intervals, slowly changing values (1000 points)
const compressed_size = writer.bytesUsed();
const uncompressed_size = count * (8 + 8);  // 16 bytes per point

// Gorilla achieves > 2x compression for regular data
try testing.expect(compression_ratio > 2.0);

Query Execution

Query Path

Queries execute against both cache and disk:
pub fn query(
    self: *Self, 
    series_key: []const u8, 
    timestamp_start: i64, 
    timestamp_end: i64, 
    op: QueryOp
) !Value {
    // Query cache (hot path)
    const values_cache = self.queryCache(series_key, timestamp_start, timestamp_end) 
        catch try self.allocator.alloc(Value, 0);
    defer self.allocator.free(values_cache);
    
    // Query disk (cold path)
    const values_disk = self.queryDisk(series_key, timestamp_start, timestamp_end) 
        catch try self.allocator.alloc(Value, 0);
    defer self.allocator.free(values_disk);
    
    // Merge and aggregate
    const values = try std.mem.concat(self.allocator, Value, &.{ values_cache, values_disk });
    defer self.allocator.free(values);
    
    return switch (op) {
        .AVG => /* ... */,
        .MIN => /* ... */,
        .MAX => /* ... */,
        .SUM => /* ... */,
        .COUNT => /* ... */,
    };
}

Cache Query

Direct skiplist range scan:
pub fn getRange(
    self: *Self, 
    series_key: []const u8, 
    timestamp_start: i64, 
    timestamp_end: i64
) ![]Value {
    if (!self.bloom.contains(series_key)) {
        return error.SeriesNotFound;
    }
    
    const skiplist = self.index_series.get(series_key) 
        orelse return error.SeriesNotFound;
    
    var values: std.ArrayList(Value) = .empty;
    defer values.deinit(self.allocator);
    
    var iter = skiplist.head.next();
    while (iter) |node| : (iter = node.next()) {
        if (node.key >= timestamp_start and node.key <= timestamp_end) {
            try values.append(self.allocator, node.value);
        }
        if (node.key > timestamp_end) break;
    }
    
    return values.toOwnedSlice(self.allocator);
}

Disk Query

Optimized with metadata and Bloom filters (tsm.zig:86-122):
fn queryDisk(
    self: *Self, 
    series_key: []const u8, 
    timestamp_start: i64, 
    timestamp_end: i64
) ![]Value {
    var values_list: std.ArrayList(Value) = .empty;
    
    for (0..self.entries_count) |en_id| {
        const en = self.entries[en_id];
        
        // Skip entry if timestamp range doesn't overlap
        if (en.metadata.max_timestamp < timestamp_start or 
            en.metadata.min_timestamp > timestamp_end) {
            continue;
        }
        
        // Skip entry if series definitely not present (Bloom filter)
        if (!en.mayContainSeries(series_key)) {
            continue;
        }
        
        // Get row range for series
        const series_ids = en.index_series.get(series_key) orelse continue;
        
        // Read timestamps and filter by range
        const time_values = try en.getColumnRange(
            "time", 
            series_ids[0], 
            series_ids[1]
        );
        defer self.allocator.free(time_values);
        
        // Identify rows in range
        var timestamp_ids: [2]?u64 = .{ null, null };
        for (time_values, series_ids[0]..series_ids[1] + 1) |time, time_id| {
            if (time.Int >= timestamp_start and time.Int <= timestamp_end) {
                if (timestamp_ids[0] == null) timestamp_ids[0] = @intCast(time_id);
                timestamp_ids[1] = @intCast(time_id);
            }
        }
        
        // Read corresponding values
        const values_entry = try en.getColumnRange(
            "value", 
            timestamp_ids[0].?, 
            timestamp_ids[1].?
        );
        defer self.allocator.free(values_entry);
        try values_list.appendSlice(self.allocator, values_entry);
    }
    
    return values_list.toOwnedSlice(self.allocator);
}

Value Serialization

Values are serialized with a type tag (entry.zig:329-351):
[type:u8][data:bytes]
Type tags:
  • 0: Bool (1 byte: 0/1)
  • 1: Int (8 bytes: i64 little-endian)
  • 2: Float (8 bytes: f64)
  • 3: Bytes (4 bytes length + data)

Performance Optimizations

Page Descriptors Cache

Disk entries cache page descriptors to avoid repeated index reads:
cached_page_descriptors: []?[]PageDescriptor,

fn getOrLoadPageDescriptors(self: *Self, column_id: u64) ![]PageDescriptor {
    if (self.cached_page_descriptors[column_id]) |cached| {
        return cached;
    }
    
    // Load from disk and cache
    const page_descriptors = try self.loadPageDescriptors(column_id);
    self.cached_page_descriptors[column_id] = page_descriptors;
    return page_descriptors;
}

Sequential Decoding

Gorilla/delta encoding enables sequential decoding without random access:
if (is_time_column_gorilla) {
    const row_count = page_desc.end_row - page_desc.start_row + 1;
    const timestamps = try decodeGorillaTimestamps(
        self.allocator, 
        page_data, 
        @intCast(row_count)
    );
    
    for (timestamps, 0..) |ts, i| {
        const current_row = page_desc.start_row + i;
        if (current_row >= start_row and current_row <= end_row) {
            values[index_values] = Value{ .Int = ts };
            index_values += 1;
        }
    }
}

Configuration Trade-offs

Pros:
  • Best compression ratio (>2x for regular data)
  • Excellent for regular-interval time series
  • Streaming decompression
Cons:
  • Must decode sequentially from start
  • CPU overhead for encoding/decoding
  • Less efficient for irregular data

Next Steps

Data Model

Understand how series and tags work

WASM Workflows

Build real-time processing workflows

Build docs developers (and LLMs) love