Modality configuration system

The modality configuration defines how your robot’s data should be loaded, processed, and interpreted by the model. This configuration bridges your dataset’s physical structure (defined in meta/modality.json) and the model’s data processing pipeline.

Configuration structure

A modality configuration is a Python dictionary containing four top-level keys:

video

ModalityConfig

Defines which camera views to use and how to sample video frames temporally.

state

ModalityConfig

Defines proprioceptive observations (joint positions, gripper states, etc.) and normalization.

action

ModalityConfig

Defines the action space, prediction horizon, and action representation format.

language

ModalityConfig

Defines which language annotations to use for task conditioning.

ModalityConfig class

Each modality is configured using the ModalityConfig dataclass from gr00t/data/types.py:69-103:

@dataclass
class ModalityConfig:
    """Configuration for a modality defining how data should be sampled and loaded.

    This class specifies which indices to sample relative to a base index and which
    keys to load for a particular modality (e.g., video, state, action).
    """

    delta_indices: list[int]
    """Delta indices to sample relative to the current index."""
    
    modality_keys: list[str]
    """The keys to load for the modality in the dataset."""
    
    sin_cos_embedding_keys: list[str] | None = None
    """Optional list of keys to apply sin/cos encoding."""
    
    mean_std_embedding_keys: list[str] | None = None
    """Optional list of keys to apply mean/std normalization."""
    
    action_configs: list[ActionConfig] | None = None

Required fields

delta_indices

list[int]

required

Defines which temporal offsets to sample relative to the current timestep.Examples:

[0] - Current frame only
[-2, -1, 0] - Last 3 frames for temporal stacking
list(range(0, 16)) - 16-step action prediction horizon

modality_keys

list[str]

required

Specifies which keys to load from your dataset. These keys must match the keys defined in your meta/modality.json file.Examples:

Video: ["front", "wrist"]
State: ["single_arm", "gripper"]
Action: ["single_arm", "gripper"]
Language: ["annotation.human.action.task_description"]

Optional fields

sin_cos_embedding_keys

list[str] | None

Specifies which state keys should use sine/cosine encoding. Best for dimensions in radians (e.g., joint angles).

This will duplicate the number of dimensions by 2, and is only recommended for proprioceptive states.

mean_std_embedding_keys

list[str] | None

Specifies which keys should use mean/standard deviation normalization instead of min-max normalization.

action_configs

list[ActionConfig] | None

Required for the action modality. Defines how each action modality should be interpreted and transformed. The list must have the same length as modality_keys.

Action configuration

The ActionConfig class defines how actions should be interpreted from gr00t/data/types.py:61-65:

@dataclass
class ActionConfig:
    rep: ActionRepresentation
    type: ActionType
    format: ActionFormat
    state_key: str | None = None

ActionConfig fields

rep

ActionRepresentation

required

Defines how actions should be interpreted:

RELATIVE - Actions are deltas from the current state
DELTA - Alternative name for relative actions
ABSOLUTE - Actions are target positions

Using relative actions leads to smoother actions, but might suffer from drifting. If you want to use relative actions, make sure the state and action stored in the dataset are absolute.

type

ActionType

required

Specifies the control space:

EEF - End-effector/Cartesian space control (expects 9-dimensional vector: x, y, z positions + rotation 6D)
NON_EEF - Joint space control and other non-EEF control spaces

format

ActionFormat

required

Defines the action representation format:

DEFAULT - Standard format (e.g., joint angles, gripper positions)
XYZ_ROT6D - 3D position + 6D rotation representation for end-effector control
XYZ_ROTVEC - 3D position + rotation vector for end-effector control

state_key

str | None

Specifies the corresponding reference state key for computing relative actions when rep=RELATIVE. If not provided, the system will use the action key as the reference state key.

Complete example: SO-100

Here’s a complete configuration example from the SO-100 robot:

from gr00t.configs.data.embodiment_configs import register_modality_config
from gr00t.data.types import (
    ModalityConfig,
    ActionConfig,
    ActionRepresentation,
    ActionType,
    ActionFormat
)

so100_config = {
    "video": ModalityConfig(
        delta_indices=[0],  # Current frame only
        modality_keys=["front", "wrist"],  # Two camera views
    ),
    "state": ModalityConfig(
        delta_indices=[0],  # Current state
        modality_keys=[
            "single_arm",  # Joint positions
            "gripper",     # Gripper state
        ],
    ),
    "action": ModalityConfig(
        delta_indices=list(range(0, 16)),  # 16-step horizon
        modality_keys=[
            "single_arm",
            "gripper",
        ],
        action_configs=[
            # Single arm - relative control
            ActionConfig(
                rep=ActionRepresentation.RELATIVE,
                type=ActionType.NON_EEF,
                format=ActionFormat.DEFAULT,
            ),
            # Gripper - absolute control
            ActionConfig(
                rep=ActionRepresentation.ABSOLUTE,
                type=ActionType.NON_EEF,
                format=ActionFormat.DEFAULT,
            ),
        ],
    ),
    "language": ModalityConfig(
        delta_indices=[0],
        modality_keys=["annotation.human.task_description"],
    ),
}

register_modality_config(so100_config)

Configuring each modality

Video modality

Defines which camera views to use:

"video": ModalityConfig(
    delta_indices=[0],  # Current frame only
    modality_keys=[
        "front",  # Must match a key in meta/modality.json
    ],
)

For multiple cameras:

"video": ModalityConfig(
    delta_indices=[0],
    modality_keys=["front", "wrist"],
)

State modality

Defines proprioceptive observations:

"state": ModalityConfig(
    delta_indices=[0],  # Current state
    modality_keys=[
        "single_arm",  # Must match keys in meta/modality.json
        "gripper",
    ],
)

With sin/cos encoding for joint angles:

"state": ModalityConfig(
    delta_indices=[0],
    modality_keys=["single_arm", "gripper"],
    sin_cos_embedding_keys=["single_arm"],  # Apply to joints
)

Action modality

Defines the action space and prediction horizon:

"action": ModalityConfig(
    delta_indices=list(range(0, 16)),  # Predict 16 steps
    modality_keys=[
        "single_arm",
        "gripper",
    ],
    action_configs=[
        # One ActionConfig per modality_key
        ActionConfig(
            rep=ActionRepresentation.RELATIVE,
            type=ActionType.NON_EEF,
            format=ActionFormat.DEFAULT,
        ),
        ActionConfig(
            rep=ActionRepresentation.ABSOLUTE,
            type=ActionType.NON_EEF,
            format=ActionFormat.DEFAULT,
        ),
    ],
)

If you modify delta_indices for the action modality, you must regenerate the dataset statistics by re-running:

python gr00t/data/stats.py <dataset_path> <embodiment_tag>

Language modality

Defines which language annotations to use:

"language": ModalityConfig(
    delta_indices=[0],
    modality_keys=[
        "annotation.human.action.task_description"
    ],
)

Relationship with meta/modality.json

The modality configuration’s modality_keys must reference keys that exist in your dataset’s meta/modality.json:

{
    "state": {
        "single_arm": {"start": 0, "end": 5},
        "gripper": {"start": 5, "end": 6}
    },
    "action": {
        "single_arm": {"start": 0, "end": 5},
        "gripper": {"start": 5, "end": 6}
    },
    "video": {
        "front": {"original_key": "observation.images.front"},
        "wrist": {"original_key": "observation.images.wrist"}
    },
    "annotation": {
        "human.task_description": {
            "original_key": "task_index"
        }
    }
}

The system will:

Use modality_keys to look up the corresponding entries in meta/modality.json
Extract the correct slices from the concatenated state/action arrays
Apply the specified transformations (normalization, action representation conversion)

Registering your configuration

After defining your configuration, register it for use in training and inference:

from gr00t.configs.data.embodiment_configs import register_modality_config

your_modality_config = {
    "video": ModalityConfig(...),
    "state": ModalityConfig(...),
    "action": ModalityConfig(...),
    "language": ModalityConfig(...),
}

register_modality_config(your_modality_config)

Save your configuration to a Python file and pass the path to the modality_config_path argument when running the finetuning script.

Next steps

Data format

Understand the LeRobot v2 data format

Fine-tuning guide

Start fine-tuning with your config

Embodiment tags

Learn about supported robots

Overview

Getting Started

Core Concepts

Guides

Benchmarks & Examples

Deployment

Resources

Modality configuration system

Configuration structure

ModalityConfig class

Required fields

Optional fields

Action configuration

ActionConfig fields

Complete example: SO-100

Configuring each modality

Relationship with meta/modality.json

Registering your configuration

Next steps

Data format

Fine-tuning guide

Embodiment tags

Build docs developers (and LLMs) love

Overview

Getting Started

Core Concepts

Guides

Benchmarks & Examples

Deployment

Resources

Documentation Index

​Configuration structure

​ModalityConfig class

​Required fields

​Optional fields

​Action configuration

​ActionConfig fields

​Complete example: SO-100

​Configuring each modality

​Relationship with meta/modality.json

​Registering your configuration

​Next steps

Data format

Fine-tuning guide

Embodiment tags

Build docs developers (and LLMs) love

Configuration structure

ModalityConfig class

Required fields

Optional fields

Action configuration

ActionConfig fields

Complete example: SO-100

Configuring each modality

Relationship with meta/modality.json

Registering your configuration

Next steps