Storage Engine

TSM Tree Structure

The TSM (Time-Structured Merge) tree is implemented in src/tsm/tsm.zig as a two-tier storage system:

pub fn TsmTreeImpl(comptime max_level: u64, comptime page_size: u32, comptime ts_encoding: TimestampEncoding) type {
    const MAX_CACHE_POINTS = 1_000_000;
    
    return struct {
        allocator: Allocator,
        name: []const u8,
        entries: []*DiskEntry(page_size, ts_encoding),
        entries_count: u64 = 0,
        cache: *Cache(page_size, ts_encoding),
    };
}

Default configuration:

pub const TsmTree = TsmTreeImpl(100_000, 4096, .gorilla);

In-Memory Cache

The cache layer (src/tsm/cache.zig) uses a skip list for ordered storage:

Cache Structure

const Self = struct {
    allocator: Allocator,
    index_series: HashMap(Skiplist(i64, Value, 16, std.Random.Pcg, ds.skiplist.compareI64)),
    bloom: Bloom(1024, Hasher),
    count: u64 = 0,
};

Skip list: 16 levels, O(log n) insertion and search
Series indexing: HashMap maps series keys to individual skip lists
Bloom filter: 1024-bit filter for series existence checks
Point counting: Tracks total points for flush threshold

Data Point Types

pub const Value = union(enum) {
    Bool: bool,
    Int: i64,
    Float: f64,
    Bytes: []const u8,
};

pub const DataPoint = struct {
    timestamp: i64,
    value: Value,
};

Cache Operations

Insert:

pub fn insert(self: *Self, series_key: []const u8, data_point: DataPoint) !void {
    const series = try self.index_series.getOrPut(series_key);
    if (!series.found_existing) {
        const owned_key = try self.allocator.dupe(u8, series_key);
        series.key_ptr.* = owned_key;
        const skiplist = try Skiplist(...).init(self.allocator, ...);
        series.value_ptr.* = skiplist;
        self.bloom.insert(series_key);
    }
    _ = try series.value_ptr.*.insert(data_point.timestamp, data_point.value);
    self.count += 1;
}

Range query:

pub fn getRange(self: *Self, series_key: []const u8, timestamp_start: i64, timestamp_end: i64) ![]Value {
    if (!self.bloom.contains(series_key)) {
        return error.SeriesNotFound;
    }
    
    const skiplist = self.index_series.get(series_key) orelse return error.SeriesNotFound;
    
    var values: std.ArrayList(Value) = .empty;
    var iter: ?*Node = skiplist.head.next();
    while (iter) |node| : (iter = node.next()) {
        if (node.key >= timestamp_start and node.key <= timestamp_end) {
            try values.append(self.allocator, node.value);
        }
        if (node.key > timestamp_end) break;
    }
    
    return values.toOwnedSlice(self.allocator);
}

Automatic Flushing

When cache reaches MAX_CACHE_POINTS, data is flushed to disk:

pub fn insert(self: *Self, series_key: []const u8, data_point: DataPoint) !void {
    try self.cache.insert(series_key, data_point);
    if (self.cache.count > MAX_CACHE_POINTS) try self.flush();
}

Flush process:

Write cache contents to new disk entry
Increment entry count
Clear cache and reset Bloom filter
Reset point counter

Disk Storage

Disk entries (src/tsm/entry.zig) use a columnar format with compression:

Entry Structure

pub const Metadata = struct {
    number_rows: u64,
    number_columns: u32,
    created_at: i64,
    page_size: u32,
    version: u32,
    min_timestamp: i64,
    max_timestamp: i64,
};

pub const ColumnDescriptor = struct {
    name: []const u8,
    data_offset: u64,
    data_size: u64,
    num_pages: u32,
    offset_index_page: u64,
    offset_offsets_row: u64,
};

File Format

Each disk entry creates two files:

_{level}_{name}.dat - compressed data
_{level}_{name}.idx - indexes and metadata

File structure:

.dat file:
  [time column data - compressed timestamps]
  [value column data - serialized values]

.idx file:
  [page descriptors]
  [Bloom filter]
  [row index]
  [series index]
  [footer with metadata]
  [footer offset + magic "SLZ01"]

Timestamp Encoding

Two encoding options: 1. Gorilla encoding (better compression):

if (ts_encoding == .gorilla) {
    var ts_encoder = gorilla.TimestampEncoder.init();
    var bit_writer = gorilla.BitWriter.init(gorilla_buf);
    
    var node_iter: ?*Node = series.value_ptr.head.next();
    while (node_iter) |node| : (node_iter = node.next()) {
        ts_encoder.encode(&bit_writer, node.key);
        row_id += 1;
    }
    
    const encoded_data = bit_writer.getWrittenData();
}

Uses delta-of-delta bit packing for maximum compression. 2. Delta encoding (faster queries):

else { // delta encoding
    var prev_timestamp: i64 = 0;
    var node_iter: ?*Node = series.value_ptr.head.next();
    while (node_iter) |node| : (node_iter = node.next()) {
        if (is_first_in_series) {
            serialize_buf[0] = 0xFF;  // marker
            std.mem.writeInt(i64, serialize_buf[1..9], node.key, .little);
            serialized_len = 9;
        } else {
            const delta = node.key - prev_timestamp;
            serialized_len = writeZigzagVarint(delta, &serialize_buf);
        }
        prev_timestamp = node.key;
    }
}

Stores first timestamp raw, then zigzag-encoded deltas.

Value Serialization

fn serializeValueStatic(value: Value, buf: []u8) []u8 {
    buf[0] = @intFromEnum(value);  // type tag
    
    switch (value) {
        .Bool => |v| {
            buf[1] = @intFromBool(v);
            return buf[0..2];
        },
        .Int => |v| {
            std.mem.writeInt(i64, buf[1..9], v, .little);
            return buf[0..9];
        },
        .Float => |v| {
            @memcpy(buf[1..9], std.mem.asBytes(&v));
            return buf[0..9];
        },
        .Bytes => |v| {
            std.mem.writeInt(u32, buf[1..5], @intCast(v.len), .little);
            @memcpy(buf[5..][0..v.len], v);
            return buf[0 .. 5 + v.len];
        },
    }
}

Bloom Filters

Bloom filters provide fast series existence checks:

bloom: Bloom(1024, Hasher)

pub fn mayContainSeries(self: *Self, series_key: []const u8) bool {
    return self.bloom.contains(series_key);
}

1024-bit filter size
Inserted on first write to series
Persisted to disk with entry
Checked before expensive disk reads

Query Execution

Queries span cache and disk transparently:

pub fn query(self: *Self, series_key: []const u8, timestamp_start: i64, timestamp_end: i64, op: QueryOp) !Value {
    const values_cache = self.queryCache(series_key, timestamp_start, timestamp_end) catch try self.allocator.alloc(Value, 0);
    defer self.allocator.free(values_cache);
    
    const values_disk = self.queryDisk(series_key, timestamp_start, timestamp_end) catch try self.allocator.alloc(Value, 0);
    defer self.allocator.free(values_disk);
    
    const values = try std.mem.concat(self.allocator, Value, &.{ values_cache, values_disk });
    defer self.allocator.free(values);
    
    return switch (op) {
        .AVG => /* calculate average */,
        .MIN => /* find minimum */,
        .MAX => /* find maximum */,
        .SUM => /* sum values */,
        .COUNT => Value{ .Float = @floatFromInt(values.len) },
    };
}

Disk query optimization:

Check timestamp range against entry metadata
Check Bloom filter for series existence
Look up series in series index
Read only relevant pages
Decompress timestamps for range filtering

Skip List Performance

The 16-level skip list provides:

Average O(log n) insertion: ~4-5 comparisons for 1M points
Average O(log n) search: similar performance
O(n) range scan: efficient due to linked structure
Memory overhead: ~32 bytes per node (16 forward pointers)

Storage Efficiency

From billion-point benchmark:

Total: 9.12 bytes per point
Breakdown:
- Gorilla-encoded timestamps: ~1-2 bytes/point
- Float64 values: ~8 bytes/point (with type tag)
- Indexes and metadata: Less than 1 byte/point amortized
- Bloom filters: negligible

Get Started

Core Concepts

Ingestion

Querying

Workflows

Architecture

Storage Engine

TSM Tree Structure

In-Memory Cache

Cache Structure

Data Point Types

Cache Operations

Automatic Flushing

Disk Storage

Entry Structure

File Format

Timestamp Encoding

Value Serialization

Bloom Filters

Query Execution

Skip List Performance

Storage Efficiency

Build docs developers (and LLMs) love

Get Started

Core Concepts

Ingestion

Querying

Workflows

Architecture

Documentation Index

​TSM Tree Structure

​In-Memory Cache

​Cache Structure

​Data Point Types

​Cache Operations

​Automatic Flushing

​Disk Storage

​Entry Structure

​File Format

​Timestamp Encoding

​Value Serialization

​Bloom Filters

​Query Execution

​Skip List Performance

​Storage Efficiency

Build docs developers (and LLMs) love

TSM Tree Structure

In-Memory Cache

Cache Structure

Data Point Types

Cache Operations

Automatic Flushing

Disk Storage

Entry Structure

File Format

Timestamp Encoding

Value Serialization

Bloom Filters

Query Execution

Skip List Performance

Storage Efficiency