H2O-3 organizes tabular data into three nested structures: Frame, Vec, and Chunk. Understanding this hierarchy helps you write efficient code and reason about how data is distributed across a cluster.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/h2oai/h2o-3/llms.txt
Use this file to discover all available pages before exploring further.
The Frame / Vec / Chunk hierarchy
H2OFrame
H2OFrame is the primary 2D data structure in H2O-3. It is analogous to a pandas DataFrame or an R data.frame, but the data lives in the H2O cluster, not in client memory. An H2OFrame object in Python or R is a lightweight handle to that remote data.
Data is not held in the Python or R process. The
H2OFrame object contains a reference (key) to the data stored in the cluster’s distributed key-value store (DKV).Vec
AVec is a single distributed column. Conceptually it is a database column, but the data is split into Chunks and spread across nodes. All Vecs in a Frame share a VectorGroup, which guarantees that same-numbered chunks across all columns cover the same row ranges.
You generally do not interact with Vec directly in Python or R — it is an internal Java class (water.fvec.Vec).
Chunk
AChunk is a contiguous block of rows within a single Vec, stored entirely on one node. Chunks typically hold between 1,000 and 1,000,000 rows. MRTask computations operate on one Chunk at a time, on the node where the Chunk lives, avoiding network data movement.
Supported column types
H2O-3 supports the following column types:| Type | Description | Python alias | R alias |
|---|---|---|---|
| numeric | 64-bit floating point (covers int and real) | "numeric" | numeric |
| categorical / enum | Factor with an internal integer encoding | "enum" or "factor" | factor |
| string | Variable-length text | "string" | character |
| time | Unix timestamp in milliseconds | "time" | POSIXct |
| uuid | 128-bit identifier | "uuid" | — |
How data is distributed
When you import a file or create a Frame, H2O-3 divides the data into Chunks and distributes them across nodes. The distribution is determined by consistent hashing of the Vec’sKey. The chunk size is chosen automatically based on available memory and cluster size.
Each node holds a roughly equal share of each column. Because chunk alignment is guaranteed by the VectorGroup, processing a row requires reading the same-indexed chunk from each Vec — all on the same node.
Key operations
Dimensions and structure
Summary statistics
Column types
Converting to and from pandas / R data.frame
- Python
- R