Data types

The data types module defines core data structures, enums, and configurations used throughout the GR00T data pipeline.

Enums

MessageType

from gr00t.data.types import MessageType

Defines the type of message in the VLA data pipeline.

START_OF_EPISODE

str

Value: "start_of_episode". Marks the beginning of an episode.

END_OF_EPISODE

str

Value: "end_of_episode". Marks the end of an episode.

EPISODE_STEP

str

Value: "episode_step". Contains VLAStepData for a single timestep.

IMAGE

str

Value: "image". Contains image data.

TEXT

str

Value: "text". Contains text/language data.

ActionRepresentation

from gr00t.data.types import ActionRepresentation

Defines how actions are represented.

RELATIVE

str

Value: "relative". Actions relative to current state.

DELTA

str

Value: "delta". Change in state.

ABSOLUTE

str

Value: "absolute". Absolute target state.

ActionType

from gr00t.data.types import ActionType

Defines whether actions are end-effector or non-end-effector.

EEF

str

Value: "eef". End-effector actions (Cartesian space).

NON_EEF

str

Value: "non_eef". Non-end-effector actions (joint space).

ActionFormat

from gr00t.data.types import ActionFormat

Defines the format of action representation.

DEFAULT

str

Value: "default". Default action format.

XYZ_ROT6D

str

Value: "xyz+rot6d". 3D position + 6D rotation representation.

XYZ_ROTVEC

str

Value: "xyz+rotvec". 3D position + rotation vector.

Data structures

VLAStepData

from gr00t.data.types import VLAStepData

Represents a single step of VLA (Vision-Language-Action) data. This is the core data structure returned by datasets, containing raw observation and action data that will be processed by the SequenceVLAProcessor.

images

dict[str, list[np.ndarray]]

required

Dictionary mapping view names to lists of image arrays for temporal stacking.Example: {"front_cam": [np.ndarray], "wrist_cam": [np.ndarray]}

states

dict[str, np.ndarray]

required

Dictionary mapping state names to numpy arrays. For single step: shape (dim,). For trajectory: shape (horizon, dim).Example: {"joint_positions": np.ndarray, "gripper_state": np.ndarray}

actions

dict[str, np.ndarray]

required

Dictionary mapping action names to numpy arrays with shape (horizon, dim) for action chunking.Example: {"joint_velocities": np.ndarray}

text

str | None

default:"None"

Optional task description or instruction.

embodiment

EmbodimentTag

default:"EmbodimentTag.NEW_EMBODIMENT"

Embodiment tag for cross-embodiment training.

is_demonstration

bool

default:"False"

Whether the step is a demonstration. If True, no loss should be computed for this step.

metadata

dict[str, Any]

default:"{}"

Flexible metadata that can be extended by users.

Usage example

from gr00t.data.types import VLAStepData
from gr00t.data.embodiment_tags import EmbodimentTag
import numpy as np

# Create a VLA step data instance
vla_step = VLAStepData(
    images={
        "front_cam": [np.random.rand(224, 224, 3)],
        "wrist_cam": [np.random.rand(224, 224, 3)],
    },
    states={
        "joint_positions": np.random.rand(7),
        "gripper_state": np.random.rand(1),
    },
    actions={
        "joint_velocities": np.random.rand(16, 7),  # 16-step action chunk
    },
    text="Pick up the apple",
    embodiment=EmbodimentTag.UNITREE_G1,
)

ActionConfig

from gr00t.data.types import ActionConfig

Configuration for action representation and control.

rep

ActionRepresentation

required

Action representation type (relative, delta, or absolute).

type

ActionType

required

Action type (end-effector or non-end-effector).

format

ActionFormat

required

Action format (default, xyz+rot6d, or xyz+rotvec).

state_key

str | None

default:"None"

Optional state key for computing relative actions.

Usage example

from gr00t.data.types import (
    ActionConfig,
    ActionRepresentation,
    ActionType,
    ActionFormat,
)

# Configure relative joint actions
action_config = ActionConfig(
    rep=ActionRepresentation.RELATIVE,
    type=ActionType.NON_EEF,
    format=ActionFormat.DEFAULT,
)

# Configure absolute end-effector actions
eef_config = ActionConfig(
    rep=ActionRepresentation.ABSOLUTE,
    type=ActionType.EEF,
    format=ActionFormat.XYZ_ROT6D,
    state_key="ee_pose",
)

ModalityConfig

from gr00t.data.types import ModalityConfig

Configuration for a modality defining how data should be sampled and loaded. This class specifies which indices to sample relative to a base index and which keys to load for a particular modality (e.g., video, state, action).

delta_indices

list[int]

required

Delta indices to sample relative to the current index. The returned data will correspond to the original data at a sampled base index + delta indices.Example: [0] for single timestep, [-1, 0] for current and previous, list(range(16)) for 16-step action chunk.

modality_keys

list[str]

required

The keys to load for the modality in the dataset.Example: ["front_cam", "wrist_cam"] for video, ["joint_positions", "gripper_state"] for state.

sin_cos_embedding_keys

list[str] | None

default:"None"

Optional list of keys to apply sin/cos encoding. If None or empty, use min/max normalization for all keys.

mean_std_embedding_keys

list[str] | None

default:"None"

Optional list of keys to apply mean/std normalization. If None or empty, use min/max normalization for all keys.

action_configs

list[ActionConfig] | None

default:"None"

Optional list of ActionConfig objects, one per modality key. Must match the number of modality_keys if provided.

Usage example

from gr00t.data.types import (
    ModalityConfig,
    ActionConfig,
    ActionRepresentation,
    ActionType,
    ActionFormat,
)

# Video modality: single timestep, multiple cameras
video_config = ModalityConfig(
    delta_indices=[0],
    modality_keys=["front_cam", "wrist_cam"],
)

# State modality: single timestep, multiple state components
state_config = ModalityConfig(
    delta_indices=[0],
    modality_keys=["left_arm", "right_arm", "gripper"],
)

# Action modality: 30-step chunk with action configs
action_config = ModalityConfig(
    delta_indices=list(range(30)),
    modality_keys=["left_arm", "right_arm", "gripper"],
    action_configs=[
        ActionConfig(
            rep=ActionRepresentation.RELATIVE,
            type=ActionType.NON_EEF,
            format=ActionFormat.DEFAULT,
        ),
        ActionConfig(
            rep=ActionRepresentation.RELATIVE,
            type=ActionType.NON_EEF,
            format=ActionFormat.DEFAULT,
        ),
        ActionConfig(
            rep=ActionRepresentation.ABSOLUTE,
            type=ActionType.NON_EEF,
            format=ActionFormat.DEFAULT,
        ),
    ],
)

Policy

Data

Model

Training

Evaluation

Enums

MessageType

ActionRepresentation

ActionType

ActionFormat

Data structures

VLAStepData

Usage example

ActionConfig

Usage example

ModalityConfig

Usage example

Build docs developers (and LLMs) love

Policy

Data

Model

Training

Evaluation

Documentation Index

​Enums

​MessageType

​ActionRepresentation

​ActionType

​ActionFormat

​Data structures

​VLAStepData

​Usage example

​ActionConfig

​Usage example

​ModalityConfig

​Usage example

Build docs developers (and LLMs) love

Enums

MessageType

ActionRepresentation

ActionType

ActionFormat

Data structures

VLAStepData

Usage example

ActionConfig

Usage example

ModalityConfig

Usage example