meta/modality.json) and the model’s data processing pipeline.
Configuration structure
A modality configuration is a Python dictionary containing four top-level keys:Defines which camera views to use and how to sample video frames temporally.
Defines proprioceptive observations (joint positions, gripper states, etc.) and normalization.
Defines the action space, prediction horizon, and action representation format.
Defines which language annotations to use for task conditioning.
ModalityConfig class
Each modality is configured using theModalityConfig dataclass from gr00t/data/types.py:69-103:
Required fields
Defines which temporal offsets to sample relative to the current timestep.Examples:
[0]- Current frame only[-2, -1, 0]- Last 3 frames for temporal stackinglist(range(0, 16))- 16-step action prediction horizon
Specifies which keys to load from your dataset. These keys must match the keys defined in your
meta/modality.json file.Examples:- Video:
["front", "wrist"] - State:
["single_arm", "gripper"] - Action:
["single_arm", "gripper"] - Language:
["annotation.human.action.task_description"]
Optional fields
Specifies which state keys should use sine/cosine encoding. Best for dimensions in radians (e.g., joint angles).
This will duplicate the number of dimensions by 2, and is only recommended for proprioceptive states.
Specifies which keys should use mean/standard deviation normalization instead of min-max normalization.
Required for the action modality. Defines how each action modality should be interpreted and transformed. The list must have the same length as
modality_keys.Action configuration
TheActionConfig class defines how actions should be interpreted from gr00t/data/types.py:61-65:
ActionConfig fields
Defines how actions should be interpreted:
RELATIVE- Actions are deltas from the current stateDELTA- Alternative name for relative actionsABSOLUTE- Actions are target positions
Specifies the control space:
EEF- End-effector/Cartesian space control (expects 9-dimensional vector: x, y, z positions + rotation 6D)NON_EEF- Joint space control and other non-EEF control spaces
Defines the action representation format:
DEFAULT- Standard format (e.g., joint angles, gripper positions)XYZ_ROT6D- 3D position + 6D rotation representation for end-effector controlXYZ_ROTVEC- 3D position + rotation vector for end-effector control
Specifies the corresponding reference state key for computing relative actions when
rep=RELATIVE. If not provided, the system will use the action key as the reference state key.Complete example: SO-100
Here’s a complete configuration example from the SO-100 robot:Configuring each modality
Video modality
Video modality
Defines which camera views to use:For multiple cameras:
State modality
State modality
Defines proprioceptive observations:With sin/cos encoding for joint angles:
Action modality
Action modality
Defines the action space and prediction horizon:
Language modality
Language modality
Defines which language annotations to use:
Relationship with meta/modality.json
The modality configuration’smodality_keys must reference keys that exist in your dataset’s meta/modality.json:
- Use
modality_keysto look up the corresponding entries inmeta/modality.json - Extract the correct slices from the concatenated state/action arrays
- Apply the specified transformations (normalization, action representation conversion)
Registering your configuration
After defining your configuration, register it for use in training and inference:modality_config_path argument when running the finetuning script.
Next steps
Data format
Understand the LeRobot v2 data format
Fine-tuning guide
Start fine-tuning with your config
Embodiment tags
Learn about supported robots