Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/vortex-data/vortex/llms.txt

Use this file to discover all available pages before exploring further.

An array is the in-memory representation of data in Vortex. Unlike Apache Arrow arrays, which always store data as flat, uncompressed buffers, Vortex arrays can remain in their compressed encoding in memory. An integer array might be bit-packed; a string column might stay FSST-encoded. Compute operates on these compressed representations directly where encoding-specific kernels exist, falling back to decompression only when necessary. Structurally, each array is a tree. Every node in the tree has a length, a dtype, child arrays, raw data buffers, statistics, and a vtable that encapsulates encoding-specific behavior. This tree structure is how cascading encodings — such as dictionary codes that are themselves bit-packed — are represented.

Array tree example

The following output from Array::display_tree() shows a real compressed array tree. A single column of nullable UTF-8 strings is represented as a dictionary-encoded array, where the codes are run-end encoded and then bit-packed using FastLanes:
vortex.dict(utf8?, len=1112) nbytes=11.89 kB (0.10%) [all_valid]
  metadata: DictMetadata { values_len: 224, codes_ptype: U16 }
  codes: vortex.slice(u16, len=1112) nbytes=3.46 kB (29.07%)
    metadata: 474600..475712
    child: vortex.runend(u16, len=475712) nbytes=3.46 kB (100.00%)
      metadata: RunEndMetadata { ends_ptype: U32, num_runs: 981, offset: 0 }
      ends: fastlanes.for(u32, len=981) nbytes=2.43 kB (70.37%) [nulls=0, min=62353u32, max=475712u32, strict]
        metadata: 62353u32
        encoded: fastlanes.bitpacked(u32, len=981) nbytes=2.43 kB (100.00%) [nulls=0, min=0u32, max=413359u32]
          metadata: BitPackedMetadata { bit_width: 19, offset: 0, patches: None }
          buffer: packed host 2.43 kB (align=4) (100.00%)
      values: fastlanes.bitpacked(u16, len=981) nbytes=1.02 kB (29.63%) [nulls=0, min=0u16, max=223u16]
        metadata: BitPackedMetadata { bit_width: 8, offset: 0, patches: None }
        buffer: packed host 1.02 kB (align=2) (100.00%)
  values: vortex.varbinview(utf8?, len=224) nbytes=8.43 kB (70.93%) [all_valid]
    metadata: EmptyMetadata
    buffer: buffer_0 host 4.85 kB (align=1) (57.49%)
    buffer: views host 3.58 kB (align=16) (42.51%)
The original 1,112-element string column is stored in just 11.89 kB — a fraction of its uncompressed size — while remaining directly queryable.

Canonical arrays

To avoid implementing compute for every possible combination of encodings, Vortex defines one canonical encoding per logical dtype. Any array can be fully decompressed into its canonical form. Note that canonical form only applies to the root array; child arrays may still be compressed.
DTypeCanonical array
DType::NullNullArray
DType::BoolBoolArray
DType::PrimitivePrimitiveArray
DType::Utf8VarBinViewArray
DType::BinaryVarBinViewArray
DType::StructStructArray
DType::ListListViewArray
DType::FixedSizeListFixedSizeListArray
DType::ExtensionExtensionArray

Built-in arrays

Alongside the canonical encodings, Vortex ships with additional built-in array types that handle common patterns and provide full zero-copy compatibility with Apache Arrow.
ArrayDescription
ChunkedArrayA concatenation of multiple arrays
ConstantArrayAn array where all values are the same constant
DictionaryArrayDictionary encoding for any data type
FilterArrayAn array filtered by a boolean mask
SliceArrayA sliced view over another array
ListArrayVariable-length list of elements (Arrow-compatible)
VarBinArrayVariable-length binary array (Arrow-compatible)

Compressed arrays

The encodings/ directory of the Vortex repository contains compressed array implementations maintained by the Vortex project. These are the encodings a compressor can choose from when writing data.
EncodingDescription
ALPAdaptive Lossless Floating Point
ALPrdAdaptive Lossless Floating Point for real doubles
ByteBoolByte-sized boolean arrays
DateTimePartsDecomposed date-time encoding for timestamps
DecimalBytePartsDecomposed decimal encoding
FastLanes BitPackingSIMD-optimized bit-packed integer encoding
FastLanes DeltaSIMD-optimized delta encoding
FastLanes FoRSIMD-optimized frame-of-reference encoding
FastLanes RLESIMD-optimized run-length encoding
FSSTFast Static Symbol Table for string compression
PCodecCompression-optimized integer and float compression
RunEndRun-end encoding (Arrow-compatible)
SequenceSequence encoding for fixed-interval runs
SparseFill-value plus patches for sparse data
ZigZagZig-zag encoding to remove negative integers
ZStdBinary compression using zstd

Statistics

Every array carries statistics that allow compute functions to short-circuit or optimize their behavior. Statistics are computed lazily and propagated through the array tree.
StatisticDescription
null_countNumber of null values
true_countNumber of true values (boolean arrays only)
run_countNumber of consecutive runs
is_constantWhether all values are the same
is_sortedWhether values are in sorted order
is_strict_sortedWhether values are sorted and unique
minMinimum value
maxMaximum value
uncompressed_sizeSize in memory before compression
These statistics are used heavily during scanning — for example, a zone’s min/max values let the scan engine skip entire row ranges without reading the underlying data.

Execution

The primary operation over Vortex arrays is execution: taking an arbitrary (possibly compressed) array tree and producing another tree that is closer to canonical form. When executing, Vortex first looks for encoding-specific kernels that can operate directly on the compressed data. If none is found, the array is decompressed to its canonical form and the operation is performed from there. This dispatch model means common operations on well-known encodings (such as comparing a bit-packed integer array against a scalar) are fast without any decompression.
Canonical form applies to the root of the tree only. Child arrays in a canonical array may still be in any compressed encoding.

Buffer handles

Arrays hold their raw physical data in buffer handles — opaque objects that reference an underlying allocation. Buffers are device-agnostic by design, allowing Vortex arrays to live in CPU host memory or on other devices such as GPUs without changing the array interface.

Arrow compatibility

Vortex is designed for zero-copy interoperability with Apache Arrow. The canonical array types are structurally compatible with their Arrow counterparts, and Arrow arrays can be imported into Vortex and exported back without copying data. This means existing Arrow-based pipelines can adopt Vortex incrementally.

Build docs developers (and LLMs) love