Enums
MessageType
Value:
"start_of_episode". Marks the beginning of an episode.Value:
"end_of_episode". Marks the end of an episode.Value:
"episode_step". Contains VLAStepData for a single timestep.Value:
"image". Contains image data.Value:
"text". Contains text/language data.ActionRepresentation
Value:
"relative". Actions relative to current state.Value:
"delta". Change in state.Value:
"absolute". Absolute target state.ActionType
Value:
"eef". End-effector actions (Cartesian space).Value:
"non_eef". Non-end-effector actions (joint space).ActionFormat
Value:
"default". Default action format.Value:
"xyz+rot6d". 3D position + 6D rotation representation.Value:
"xyz+rotvec". 3D position + rotation vector.Data structures
VLAStepData
Dictionary mapping view names to lists of image arrays for temporal stacking.Example:
{"front_cam": [np.ndarray], "wrist_cam": [np.ndarray]}Dictionary mapping state names to numpy arrays. For single step: shape
(dim,). For trajectory: shape (horizon, dim).Example: {"joint_positions": np.ndarray, "gripper_state": np.ndarray}Dictionary mapping action names to numpy arrays with shape
(horizon, dim) for action chunking.Example: {"joint_velocities": np.ndarray}Optional task description or instruction.
Embodiment tag for cross-embodiment training.
Whether the step is a demonstration. If True, no loss should be computed for this step.
Flexible metadata that can be extended by users.
Usage example
ActionConfig
Action representation type (relative, delta, or absolute).
Action type (end-effector or non-end-effector).
Action format (default, xyz+rot6d, or xyz+rotvec).
Optional state key for computing relative actions.
Usage example
ModalityConfig
Delta indices to sample relative to the current index. The returned data will correspond to the original data at a sampled base index + delta indices.Example:
[0] for single timestep, [-1, 0] for current and previous, list(range(16)) for 16-step action chunk.The keys to load for the modality in the dataset.Example:
["front_cam", "wrist_cam"] for video, ["joint_positions", "gripper_state"] for state.Optional list of keys to apply sin/cos encoding. If None or empty, use min/max normalization for all keys.
Optional list of keys to apply mean/std normalization. If None or empty, use min/max normalization for all keys.
Optional list of ActionConfig objects, one per modality key. Must match the number of modality_keys if provided.