Skip to main content
The data types module defines core data structures, enums, and configurations used throughout the GR00T data pipeline.

Enums

MessageType

from gr00t.data.types import MessageType
Defines the type of message in the VLA data pipeline.
START_OF_EPISODE
str
Value: "start_of_episode". Marks the beginning of an episode.
END_OF_EPISODE
str
Value: "end_of_episode". Marks the end of an episode.
EPISODE_STEP
str
Value: "episode_step". Contains VLAStepData for a single timestep.
IMAGE
str
Value: "image". Contains image data.
TEXT
str
Value: "text". Contains text/language data.

ActionRepresentation

from gr00t.data.types import ActionRepresentation
Defines how actions are represented.
RELATIVE
str
Value: "relative". Actions relative to current state.
DELTA
str
Value: "delta". Change in state.
ABSOLUTE
str
Value: "absolute". Absolute target state.

ActionType

from gr00t.data.types import ActionType
Defines whether actions are end-effector or non-end-effector.
EEF
str
Value: "eef". End-effector actions (Cartesian space).
NON_EEF
str
Value: "non_eef". Non-end-effector actions (joint space).

ActionFormat

from gr00t.data.types import ActionFormat
Defines the format of action representation.
DEFAULT
str
Value: "default". Default action format.
XYZ_ROT6D
str
Value: "xyz+rot6d". 3D position + 6D rotation representation.
XYZ_ROTVEC
str
Value: "xyz+rotvec". 3D position + rotation vector.

Data structures

VLAStepData

from gr00t.data.types import VLAStepData
Represents a single step of VLA (Vision-Language-Action) data. This is the core data structure returned by datasets, containing raw observation and action data that will be processed by the SequenceVLAProcessor.
images
dict[str, list[np.ndarray]]
required
Dictionary mapping view names to lists of image arrays for temporal stacking.Example: {"front_cam": [np.ndarray], "wrist_cam": [np.ndarray]}
states
dict[str, np.ndarray]
required
Dictionary mapping state names to numpy arrays. For single step: shape (dim,). For trajectory: shape (horizon, dim).Example: {"joint_positions": np.ndarray, "gripper_state": np.ndarray}
actions
dict[str, np.ndarray]
required
Dictionary mapping action names to numpy arrays with shape (horizon, dim) for action chunking.Example: {"joint_velocities": np.ndarray}
text
str | None
default:"None"
Optional task description or instruction.
embodiment
EmbodimentTag
default:"EmbodimentTag.NEW_EMBODIMENT"
Embodiment tag for cross-embodiment training.
is_demonstration
bool
default:"False"
Whether the step is a demonstration. If True, no loss should be computed for this step.
metadata
dict[str, Any]
default:"{}"
Flexible metadata that can be extended by users.

Usage example

from gr00t.data.types import VLAStepData
from gr00t.data.embodiment_tags import EmbodimentTag
import numpy as np

# Create a VLA step data instance
vla_step = VLAStepData(
    images={
        "front_cam": [np.random.rand(224, 224, 3)],
        "wrist_cam": [np.random.rand(224, 224, 3)],
    },
    states={
        "joint_positions": np.random.rand(7),
        "gripper_state": np.random.rand(1),
    },
    actions={
        "joint_velocities": np.random.rand(16, 7),  # 16-step action chunk
    },
    text="Pick up the apple",
    embodiment=EmbodimentTag.UNITREE_G1,
)

ActionConfig

from gr00t.data.types import ActionConfig
Configuration for action representation and control.
rep
ActionRepresentation
required
Action representation type (relative, delta, or absolute).
type
ActionType
required
Action type (end-effector or non-end-effector).
format
ActionFormat
required
Action format (default, xyz+rot6d, or xyz+rotvec).
state_key
str | None
default:"None"
Optional state key for computing relative actions.

Usage example

from gr00t.data.types import (
    ActionConfig,
    ActionRepresentation,
    ActionType,
    ActionFormat,
)

# Configure relative joint actions
action_config = ActionConfig(
    rep=ActionRepresentation.RELATIVE,
    type=ActionType.NON_EEF,
    format=ActionFormat.DEFAULT,
)

# Configure absolute end-effector actions
eef_config = ActionConfig(
    rep=ActionRepresentation.ABSOLUTE,
    type=ActionType.EEF,
    format=ActionFormat.XYZ_ROT6D,
    state_key="ee_pose",
)

ModalityConfig

from gr00t.data.types import ModalityConfig
Configuration for a modality defining how data should be sampled and loaded. This class specifies which indices to sample relative to a base index and which keys to load for a particular modality (e.g., video, state, action).
delta_indices
list[int]
required
Delta indices to sample relative to the current index. The returned data will correspond to the original data at a sampled base index + delta indices.Example: [0] for single timestep, [-1, 0] for current and previous, list(range(16)) for 16-step action chunk.
modality_keys
list[str]
required
The keys to load for the modality in the dataset.Example: ["front_cam", "wrist_cam"] for video, ["joint_positions", "gripper_state"] for state.
sin_cos_embedding_keys
list[str] | None
default:"None"
Optional list of keys to apply sin/cos encoding. If None or empty, use min/max normalization for all keys.
mean_std_embedding_keys
list[str] | None
default:"None"
Optional list of keys to apply mean/std normalization. If None or empty, use min/max normalization for all keys.
action_configs
list[ActionConfig] | None
default:"None"
Optional list of ActionConfig objects, one per modality key. Must match the number of modality_keys if provided.

Usage example

from gr00t.data.types import (
    ModalityConfig,
    ActionConfig,
    ActionRepresentation,
    ActionType,
    ActionFormat,
)

# Video modality: single timestep, multiple cameras
video_config = ModalityConfig(
    delta_indices=[0],
    modality_keys=["front_cam", "wrist_cam"],
)

# State modality: single timestep, multiple state components
state_config = ModalityConfig(
    delta_indices=[0],
    modality_keys=["left_arm", "right_arm", "gripper"],
)

# Action modality: 30-step chunk with action configs
action_config = ModalityConfig(
    delta_indices=list(range(30)),
    modality_keys=["left_arm", "right_arm", "gripper"],
    action_configs=[
        ActionConfig(
            rep=ActionRepresentation.RELATIVE,
            type=ActionType.NON_EEF,
            format=ActionFormat.DEFAULT,
        ),
        ActionConfig(
            rep=ActionRepresentation.RELATIVE,
            type=ActionType.NON_EEF,
            format=ActionFormat.DEFAULT,
        ),
        ActionConfig(
            rep=ActionRepresentation.ABSOLUTE,
            type=ActionType.NON_EEF,
            format=ActionFormat.DEFAULT,
        ),
    ],
)

Build docs developers (and LLMs) love