Skip to main content

TSM Tree Structure

The TSM (Time-Structured Merge) tree is implemented in src/tsm/tsm.zig as a two-tier storage system:
pub fn TsmTreeImpl(comptime max_level: u64, comptime page_size: u32, comptime ts_encoding: TimestampEncoding) type {
    const MAX_CACHE_POINTS = 1_000_000;
    
    return struct {
        allocator: Allocator,
        name: []const u8,
        entries: []*DiskEntry(page_size, ts_encoding),
        entries_count: u64 = 0,
        cache: *Cache(page_size, ts_encoding),
    };
}
Default configuration:
pub const TsmTree = TsmTreeImpl(100_000, 4096, .gorilla);

In-Memory Cache

The cache layer (src/tsm/cache.zig) uses a skip list for ordered storage:

Cache Structure

const Self = struct {
    allocator: Allocator,
    index_series: HashMap(Skiplist(i64, Value, 16, std.Random.Pcg, ds.skiplist.compareI64)),
    bloom: Bloom(1024, Hasher),
    count: u64 = 0,
};
  • Skip list: 16 levels, O(log n) insertion and search
  • Series indexing: HashMap maps series keys to individual skip lists
  • Bloom filter: 1024-bit filter for series existence checks
  • Point counting: Tracks total points for flush threshold

Data Point Types

pub const Value = union(enum) {
    Bool: bool,
    Int: i64,
    Float: f64,
    Bytes: []const u8,
};

pub const DataPoint = struct {
    timestamp: i64,
    value: Value,
};

Cache Operations

Insert:
pub fn insert(self: *Self, series_key: []const u8, data_point: DataPoint) !void {
    const series = try self.index_series.getOrPut(series_key);
    if (!series.found_existing) {
        const owned_key = try self.allocator.dupe(u8, series_key);
        series.key_ptr.* = owned_key;
        const skiplist = try Skiplist(...).init(self.allocator, ...);
        series.value_ptr.* = skiplist;
        self.bloom.insert(series_key);
    }
    _ = try series.value_ptr.*.insert(data_point.timestamp, data_point.value);
    self.count += 1;
}
Range query:
pub fn getRange(self: *Self, series_key: []const u8, timestamp_start: i64, timestamp_end: i64) ![]Value {
    if (!self.bloom.contains(series_key)) {
        return error.SeriesNotFound;
    }
    
    const skiplist = self.index_series.get(series_key) orelse return error.SeriesNotFound;
    
    var values: std.ArrayList(Value) = .empty;
    var iter: ?*Node = skiplist.head.next();
    while (iter) |node| : (iter = node.next()) {
        if (node.key >= timestamp_start and node.key <= timestamp_end) {
            try values.append(self.allocator, node.value);
        }
        if (node.key > timestamp_end) break;
    }
    
    return values.toOwnedSlice(self.allocator);
}

Automatic Flushing

When cache reaches MAX_CACHE_POINTS, data is flushed to disk:
pub fn insert(self: *Self, series_key: []const u8, data_point: DataPoint) !void {
    try self.cache.insert(series_key, data_point);
    if (self.cache.count > MAX_CACHE_POINTS) try self.flush();
}
Flush process:
  1. Write cache contents to new disk entry
  2. Increment entry count
  3. Clear cache and reset Bloom filter
  4. Reset point counter

Disk Storage

Disk entries (src/tsm/entry.zig) use a columnar format with compression:

Entry Structure

pub const Metadata = struct {
    number_rows: u64,
    number_columns: u32,
    created_at: i64,
    page_size: u32,
    version: u32,
    min_timestamp: i64,
    max_timestamp: i64,
};

pub const ColumnDescriptor = struct {
    name: []const u8,
    data_offset: u64,
    data_size: u64,
    num_pages: u32,
    offset_index_page: u64,
    offset_offsets_row: u64,
};

File Format

Each disk entry creates two files:
  • _{level}_{name}.dat - compressed data
  • _{level}_{name}.idx - indexes and metadata
File structure:
.dat file:
  [time column data - compressed timestamps]
  [value column data - serialized values]

.idx file:
  [page descriptors]
  [Bloom filter]
  [row index]
  [series index]
  [footer with metadata]
  [footer offset + magic "SLZ01"]

Timestamp Encoding

Two encoding options: 1. Gorilla encoding (better compression):
if (ts_encoding == .gorilla) {
    var ts_encoder = gorilla.TimestampEncoder.init();
    var bit_writer = gorilla.BitWriter.init(gorilla_buf);
    
    var node_iter: ?*Node = series.value_ptr.head.next();
    while (node_iter) |node| : (node_iter = node.next()) {
        ts_encoder.encode(&bit_writer, node.key);
        row_id += 1;
    }
    
    const encoded_data = bit_writer.getWrittenData();
}
Uses delta-of-delta bit packing for maximum compression. 2. Delta encoding (faster queries):
else { // delta encoding
    var prev_timestamp: i64 = 0;
    var node_iter: ?*Node = series.value_ptr.head.next();
    while (node_iter) |node| : (node_iter = node.next()) {
        if (is_first_in_series) {
            serialize_buf[0] = 0xFF;  // marker
            std.mem.writeInt(i64, serialize_buf[1..9], node.key, .little);
            serialized_len = 9;
        } else {
            const delta = node.key - prev_timestamp;
            serialized_len = writeZigzagVarint(delta, &serialize_buf);
        }
        prev_timestamp = node.key;
    }
}
Stores first timestamp raw, then zigzag-encoded deltas.

Value Serialization

fn serializeValueStatic(value: Value, buf: []u8) []u8 {
    buf[0] = @intFromEnum(value);  // type tag
    
    switch (value) {
        .Bool => |v| {
            buf[1] = @intFromBool(v);
            return buf[0..2];
        },
        .Int => |v| {
            std.mem.writeInt(i64, buf[1..9], v, .little);
            return buf[0..9];
        },
        .Float => |v| {
            @memcpy(buf[1..9], std.mem.asBytes(&v));
            return buf[0..9];
        },
        .Bytes => |v| {
            std.mem.writeInt(u32, buf[1..5], @intCast(v.len), .little);
            @memcpy(buf[5..][0..v.len], v);
            return buf[0 .. 5 + v.len];
        },
    }
}

Bloom Filters

Bloom filters provide fast series existence checks:
bloom: Bloom(1024, Hasher)

pub fn mayContainSeries(self: *Self, series_key: []const u8) bool {
    return self.bloom.contains(series_key);
}
  • 1024-bit filter size
  • Inserted on first write to series
  • Persisted to disk with entry
  • Checked before expensive disk reads

Query Execution

Queries span cache and disk transparently:
pub fn query(self: *Self, series_key: []const u8, timestamp_start: i64, timestamp_end: i64, op: QueryOp) !Value {
    const values_cache = self.queryCache(series_key, timestamp_start, timestamp_end) catch try self.allocator.alloc(Value, 0);
    defer self.allocator.free(values_cache);
    
    const values_disk = self.queryDisk(series_key, timestamp_start, timestamp_end) catch try self.allocator.alloc(Value, 0);
    defer self.allocator.free(values_disk);
    
    const values = try std.mem.concat(self.allocator, Value, &.{ values_cache, values_disk });
    defer self.allocator.free(values);
    
    return switch (op) {
        .AVG => /* calculate average */,
        .MIN => /* find minimum */,
        .MAX => /* find maximum */,
        .SUM => /* sum values */,
        .COUNT => Value{ .Float = @floatFromInt(values.len) },
    };
}
Disk query optimization:
  1. Check timestamp range against entry metadata
  2. Check Bloom filter for series existence
  3. Look up series in series index
  4. Read only relevant pages
  5. Decompress timestamps for range filtering

Skip List Performance

The 16-level skip list provides:
  • Average O(log n) insertion: ~4-5 comparisons for 1M points
  • Average O(log n) search: similar performance
  • O(n) range scan: efficient due to linked structure
  • Memory overhead: ~32 bytes per node (16 forward pointers)

Storage Efficiency

From billion-point benchmark:
  • Total: 9.12 bytes per point
  • Breakdown:
    • Gorilla-encoded timestamps: ~1-2 bytes/point
    • Float64 values: ~8 bytes/point (with type tag)
    • Indexes and metadata: Less than 1 byte/point amortized
    • Bloom filters: negligible

Build docs developers (and LLMs) love