Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/Neumenon/cowrie/llms.txt

Use this file to discover all available pages before exploring further.

Format Structure

Cowrie Gen2 uses a header-dictionary-value structure:
┌─────────────────────────────────────────────────────────┐
│ Header (4+ bytes)                                       │
├─────────────────────────────────────────────────────────┤
│ Magic: "SJ" (0x53 0x4A)                                 │
│ Version: 0x02                                           │
│ Flags: 0xXX                                             │
├─────────────────────────────────────────────────────────┤
│ [Column Hints] (optional, if FlagHasColumnHints set)   │
├─────────────────────────────────────────────────────────┤
│ Dictionary                                              │
├─────────────────────────────────────────────────────────┤
│ DictLen: varint                                         │
│ Keys: DictLen × [len:varint][UTF-8 bytes]              │
├─────────────────────────────────────────────────────────┤
│ Root Value (recursive encoding)                         │
└─────────────────────────────────────────────────────────┘

Magic Bytes

Cowrie Gen2 files start with the ASCII characters "SJ" (Structured JSON):
53 4A  // "SJ" magic header
Gen1 format has no magic header and starts directly with the root value type tag. Always check the first two bytes to distinguish formats.

Header Layout

Byte-by-Byte Breakdown

OffsetFieldTypeDescription
0-1Magic2 bytes0x53 0x4A (“SJ”)
2Versionu8Format version (0x02)
3Flagsu8Bitfield (see below)

Header Flags

Flags are bit-packed in byte 3:
Bit(s)NameValueDescription
0Compressed0x01Payload is compressed
1-2Compression Type0x02/0x040=none, 1=gzip, 2=zstd
3Has Column Hints0x08Column hints present after flags
4-7Reserved-Must be zero
Example Flag Values:
0x00  // No compression, no hints
0x01  // Compressed (type in bits 1-2)
0x03  // Compressed with gzip (0x01 | (1 << 1))
0x05  // Compressed with zstd (0x01 | (2 << 1))
0x08  // Column hints present
0x09  // Column hints + compressed

Varint Encoding

Cowrie uses Protocol Buffers-style unsigned varint encoding:
  • 7 bits per byte for data
  • MSB (bit 7) as continuation flag:
    • 1 = more bytes follow
    • 0 = last byte

Encoding Algorithm

fn encode_uvarint(n: u64) -> Vec<u8> {
    let mut buf = Vec::new();
    let mut val = n;
    loop {
        let mut byte = (val & 0x7F) as u8;
        val >>= 7;
        if val != 0 {
            byte |= 0x80;  // Set continuation bit
        }
        buf.push(byte);
        if val == 0 {
            break;
        }
    }
    buf
}

Decoding Algorithm

fn decode_uvarint(data: &[u8]) -> Result<(u64, usize), Error> {
    let mut result = 0u64;
    let mut shift = 0;
    for (i, &byte) in data.iter().enumerate() {
        result |= ((byte & 0x7F) as u64) << shift;
        if (byte & 0x80) == 0 {
            return Ok((result, i + 1));  // Return (value, bytes_read)
        }
        shift += 7;
        if shift >= 64 {
            return Err(Error::VarIntOverflow);
        }
    }
    Err(Error::UnexpectedEof)
}

Encoding Examples

ValueBytesExplanation
000Single byte, no continuation
1277FMax single-byte value
12880 010b10000000 → 0b0000001 0000000
300AC 020b10101100 0b00000010 → 0b00000010 0101100
16,38480 80 013 bytes
Varints are optimized for small numbers. Values 0-127 take 1 byte, 128-16,383 take 2 bytes, etc.

Zigzag Encoding

Signed integers use zigzag encoding before varint encoding:
fn zigzag_encode(n: i64) -> u64 {
    ((n << 1) ^ (n >> 63)) as u64
}

fn zigzag_decode(z: u64) -> i64 {
    ((z >> 1) ^ (-(z & 1) as u64)) as i64
}

Zigzag Mapping

SignedZigzagExplanation
00Unchanged
-11Maps to 1
12Maps to 2
-23Maps to 3
24Maps to 4
-64127Fits in 1 byte
Zigzag encoding ensures small negative numbers (like -1, -2) encode as small varints, rather than large 64-bit values.

Type Tags

Every value is prefixed with a type tag (1 byte):

Core Types (0x00-0x0F)

TagTypeEncoding
0x00NullTag only
0x01FalseTag only
0x02TrueTag only
0x03Int64Tag + zigzag varint
0x04Float64Tag + 8 bytes LE
0x05StringTag + len:varint + UTF-8
0x06ArrayTag + count:varint + elements
0x07ObjectTag + count:varint + (dictIdx:varint + value)*
0x08BytesTag + len:varint + raw bytes
0x09Uint64Tag + varint
0x0ADecimal128Tag + scale:i8 + coef:16 bytes
0x0BDatetime64Tag + nanos:i64 LE
0x0CUUID128Tag + 16 bytes
0x0DBigIntTag + len:varint + two’s complement bytes
0x0EExtensionTag + extType:varint + len:varint + payload

ML Types (0x20-0x2F)

TagTypeEncoding
0x20Tensordtype:u8 + rank:u8 + dims* + dataLen:varint + data
0x21TensorRefstoreId:u8 + keyLen:varint + key
0x22Imageformat:u8 + width:u16 + height:u16 + dataLen:varint + data
0x23Audioencoding:u8 + sampleRate:u32 + channels:u8 + dataLen:varint + data

Graph Types (0x30-0x39)

TagTypeEncoding
0x30AdjListidWidth:u8 + nodeCount + edgeCount + rowOffsets + colIndices
0x31RichTexttext + flags:u8 + tokens + spans
0x32DeltabaseId:varint + opCount:varint + ops
0x35Nodeid:string + labels* + props (dict-coded)
0x36EdgesrcId + dstId + type + props (dict-coded)
0x37NodeBatchcount:varint + Node[count]
0x38EdgeBatchcount:varint + Edge[count]
0x39GraphShardnodes + edges + metadata (dict-coded)

Encoding Examples

Null

00  // Tag only

Boolean

01  // False
02  // True

Int64

Value: 42
Zigzag: 84 (42 << 1)
Varint: 0x54

Wire format: 03 54
Value: -1
Zigzag: 1 ((-1 << 1) ^ (-1 >> 63))
Varint: 0x01

Wire format: 03 01

Float64

Value: 3.14159
IEEE 754: 0x400921FB54442D18 (little-endian)

Wire format: 04 18 2D 44 54 FB 21 09 40

String

Value: "hello"
Length: 5
UTF-8: 68 65 6C 6C 6F

Wire format: 05 05 68 65 6C 6C 6F
             │  │  └──────┬──────┘
             │  │      UTF-8 bytes
             │  └─ Length (varint)
             └─ Tag

Array

[1, 2, 3]
06        // Array tag
03        // Count = 3
03 02     // Int64: 1 (zigzag: 2)
03 04     // Int64: 2 (zigzag: 4)
03 06     // Int64: 3 (zigzag: 6)

Object (Dictionary-Coded)

Given dictionary ["name", "age"]:
{"name": "Alice", "age": 30}
07              // Object tag
02              // Field count = 2
00              // Dict index 0 ("name")
05 05 41 6C 69 63 65  // String: "Alice"
01              // Dict index 1 ("age")
03 3C           // Int64: 30 (zigzag: 60)
Object keys are encoded as dictionary indices (usually 1 byte) instead of full strings, providing massive size savings.

Compression Framing

When the Compressed flag (0x01) is set:
┌─────────────────────────────────────────┐
│ Header (4 bytes)                        │
│  - Magic: "SJ"                          │
│  - Version: 0x02                        │
│  - Flags: 0x03 (compressed + gzip)      │
├─────────────────────────────────────────┤
│ OrigLen: varint                         │
│  (uncompressed payload size)            │
├─────────────────────────────────────────┤
│ Compressed Payload                      │
│  (dictionary + root value, compressed)  │
└─────────────────────────────────────────┘

Compression Algorithm

fn encode_compressed(value: &Value, compression: Compression) -> Vec<u8> {
    // 1. Encode uncompressed payload
    let mut payload = Vec::new();
    encode_dictionary(&mut payload);
    encode_value(&mut payload, value);
    
    // 2. Compress payload
    let compressed = match compression {
        Compression::Gzip => gzip_compress(&payload),
        Compression::Zstd => zstd_compress(&payload),
        _ => panic!("Invalid compression type"),
    };
    
    // 3. Build result
    let mut result = Vec::new();
    result.extend_from_slice(b"SJ");  // Magic
    result.push(VERSION);              // Version
    result.push(compression_flags(compression));  // Flags
    write_uvarint(&mut result, payload.len() as u64);  // Original length
    result.extend_from_slice(&compressed);             // Compressed data
    result
}

Security Limits

Decoders must enforce limits to prevent memory exhaustion:
LimitDefaultPurpose
MaxDepth1,000Prevent stack overflow
MaxArrayLen100,000,000Limit array size
MaxObjectLen10,000,000Limit object fields
MaxStringLen500,000,000Limit string size (500 MB)
MaxBytesLen1,000,000,000Limit binary data (1 GB)
MaxDictLen10,000,000Limit dictionary entries
MaxExtLen100,000,000Limit extension payload
MaxRank32Limit tensor dimensions
Decompression Bomb Protection: Always check OrigLen against MaxDecompressedSize before allocating memory. Reject payloads that exceed limits.

Column Hints

Optional metadata for columnar readers (appears after header if FlagHasColumnHints is set):
HintCount: varint
Repeat HintCount times:
  Field: [len:varint][UTF-8 bytes]
  Type: u8
  ShapeLen: varint
  ShapeDims: ShapeLen × varint
  Flags: u8
Example:
// Hint: field "embeddings" is a float32 tensor with shape [100, 768]
Field: "embeddings" (10 bytes)
Type: 0x01 (float32)
ShapeLen: 2
ShapeDims: [100, 768]
Flags: 0x00
Column hints enable zero-copy reading for columnar formats like Apache Arrow. Decoders that don’t support column hints must skip this block.

Error Handling

Canonical Error Codes

CodeCondition
ERR_INVALID_MAGICFirst 2 bytes ≠ “SJ”
ERR_INVALID_VERSIONVersion byte ≠ 0x02
ERR_TRUNCATEDUnexpected end of data
ERR_INVALID_TAGUnknown type tag
ERR_INVALID_UTF8String contains invalid UTF-8
ERR_INVALID_VARINTVarint overflow (>64 bits)
ERR_TOO_DEEPNesting depth > MaxDepth
ERR_TOO_LARGEArray/object/string exceeds limits
ERR_DICT_TOO_LARGEDictionary > MaxDictLen
ERR_UNSUPPORTED_COMPRESSIONUnknown compression type
ERR_DECOMPRESSED_MISMATCHDecompressed size ≠ OrigLen

Validation Example

fn validate_header(data: &[u8]) -> Result<(), Error> {
    if data.len() < 4 {
        return Err(Error::Truncated);
    }
    if data[0] != 0x53 || data[1] != 0x4A {
        return Err(Error::InvalidMagic);
    }
    if data[2] != 0x02 {
        return Err(Error::InvalidVersion(data[2]));
    }
    Ok(())
}

Summary

Cowrie’s wire format provides: Compact encoding with varint and zigzag for small numbers
Dictionary coding for 70-80% size reduction on repeated keys
Type safety with explicit type tags
Compression with gzip/zstd support
Security with configurable limits
Extensibility via TagExt envelope
The format balances encoding efficiency with decoding speed, making it ideal for high-throughput ML/AI workloads.

Build docs developers (and LLMs) love