Vortex: Columnar Compression Framework Overview

Vortex is a next-generation columnar file format and toolkit designed for high-performance data processing on object storage. It provides a clean separation between logical types and physical encodings, allowing query engines and storage systems to apply optimal compression schemes per column—without sacrificing read speed. Vortex integrates natively with Apache Arrow, DataFusion, DuckDB, Spark, Pandas, and Polars.

The Vortex file format has been stable since v0.36.0. All future releases guarantee backwards compatibility—any file written by Vortex 0.36.0 or later will be readable by newer versions. Library APIs may still evolve between releases.

Key Features

100x Faster Random Access

Vortex delivers up to 100x faster random access reads compared to modern Apache Parquet, thanks to efficient support for wide tables with zero-copy, zero-parse metadata.

10–20x Faster Scans

Full-column scans run 10–20x faster than Parquet, enabled by optimized compute kernels that operate directly on compressed data without full decompression.

5x Faster Writes

Writing data to Vortex is up to 5x faster than Parquet while achieving similar compression ratios, making it practical as a hot-path storage format.

Similar Compression Ratios

Vortex matches Parquet’s compression ratios using a pluggable cascading compression system, including BtrBlocks, RLE, dictionary encoding, ALP, FSST, and more.

Zero-Copy Arrow Integration

Built-in encodings are fully compatible with Apache Arrow’s memory format. Convert to and from Arrow arrays with zero copies using vx.array() and .to_arrow().

Extensible Architecture

Modeled after Apache DataFusion’s plugin system: encodings, type systems, compression strategies, and layout strategies are all swappable without forking the library.

Architecture: Logical vs. Physical Layers

Vortex strictly separates logical concerns (what the data means) from physical concerns (how the data is stored). This design enables engines to choose the best encoding for each column independently. Logical Layer The logical layer defines data types and schema. Vortex’s type system (DType) covers primitives, structs, lists, strings, timestamps, and extension types. The logical type is what query engines and users interact with—it never changes regardless of how the data is physically compressed. Physical Layer The physical layer handles encoding and storage. Built-in encodings match Apache Arrow’s in-memory format for zero-copy interoperability. Extension encodings implement compressed schemes such as:

RLE — Run-length encoding for repeated values
Dictionary — Dictionary encoding for low-cardinality columns
FastLanes — High-throughput bit-packing and frame-of-reference for integers
ALP / G-ALP — Adaptive lossless floating-point compression
FSST — Fast random-access string compression
BtrBlocks — Cascading columnar compression, the default for file writes

Because compute kernels operate on encoded arrays directly, many operations avoid a full decompression step—this is the source of Vortex’s scan and random-access speed advantage.

Performance Benchmarks

The following numbers compare Vortex against modern Apache Parquet across representative workloads:

Operation	Vortex vs. Parquet
Random access reads	Up to 100x faster
Full column scans	10–20x faster
Writes	5x faster
Compression ratio	Similar

Live, continuously-updated benchmarks are published at bench.vortex.dev.

Integrations

Vortex works with the tools already in your stack:

Query engines: Apache DataFusion, DuckDB
Runtimes: Apache Spark (via JNI connector)
DataFrame libraries: Pandas, Polars
Memory format: Apache Arrow (zero-copy)
Coming soon: Apache Iceberg

Open Source and Governance

Vortex is a Linux Foundation AI & Data sub-project, licensed under Apache-2.0. It is not controlled by any single company. The governance model is documented in CONTRIBUTING.md and governed by the Technical Charter.

Get Started

Choose your language to start reading and writing Vortex files in minutes:

Python Quickstart

Install vortex-data, write arrays to .vortex files, and query them with filter and projection pushdown.

Rust Quickstart

Add the vortex crate, create a VortexSession, and read/write compressed files with async Tokio.

Java Quickstart

Use the Spark connector or standalone JNI library to access Vortex files from the JVM.

Get Started

Core Concepts

Query Engine Integrations

Extending Vortex

Internals & Architecture

Vortex: Columnar Compression Framework Overview

Key Features

100x Faster Random Access

10–20x Faster Scans

5x Faster Writes

Similar Compression Ratios

Zero-Copy Arrow Integration

Extensible Architecture

Architecture: Logical vs. Physical Layers

Performance Benchmarks

Integrations

Open Source and Governance

Get Started

Python Quickstart

Rust Quickstart

Java Quickstart

Build docs developers (and LLMs) love

Get Started

Core Concepts

Query Engine Integrations

Extending Vortex

Internals & Architecture

Documentation Index

​Key Features

100x Faster Random Access

10–20x Faster Scans

5x Faster Writes

Similar Compression Ratios

Zero-Copy Arrow Integration

Extensible Architecture

​Architecture: Logical vs. Physical Layers

​Performance Benchmarks

​Integrations

​Open Source and Governance

​Get Started

Python Quickstart

Rust Quickstart

Java Quickstart

Build docs developers (and LLMs) love

Key Features

Architecture: Logical vs. Physical Layers

Performance Benchmarks

Integrations

Open Source and Governance

Get Started