Overview
Each embodiment requires a Python configuration file that specifies:- Which observations to use (video cameras, proprioceptive states)
- How to sample data temporally (current frame, historical frames, future action horizons)
- How actions should be interpreted and transformed
- Which language annotations to use
Configuration structure
A modality configuration is a Python dictionary containing four top-level keys:"video", "state", "action", and "language". Each key maps to a ModalityConfig object.
Here’s the SO-100 example from examples/SO100/so100_config.py:
Understanding ModalityConfig
EachModalityConfig specifies two required fields and several optional ones.
Required fields
delta_indices (list[int])
Defines which temporal offsets to sample relative to the current timestep. This enables:- Historical context: Use negative indices (e.g.,
[-2, -1, 0]) to include past observations - Current observation: Use
[0]for the current timestep - Future actions: Use positive indices (e.g.,
list(range(0, 16))) for action prediction horizons
modality_keys (list[str])
Specifies which keys to load from your dataset. These keys must match the keys defined in yourmeta/modality.json file.
For the SO-100 example:
- Video keys: Must match keys in
meta/modality.jsonunder"video"(e.g.,"front","wrist") - State keys: Must match keys in
meta/modality.jsonunder"state"(e.g.,"single_arm","gripper") - Action keys: Must match keys in
meta/modality.jsonunder"action"(e.g.,"single_arm","gripper") - Language keys: Must match keys in
meta/modality.jsonunder"annotation"(e.g.,"annotation.human.action.task_description")
Optional fields
sin_cos_embedding_keys (list[str] | None)
Specifies which state keys should use sine/cosine encoding. Best for dimensions that are in radians (e.g., joint angles). If not specified, min-max normalization is used.Sine/cosine embedding will duplicate the number of dimensions by 2, and is only recommended for proprioceptive states.
mean_std_embedding_keys (list[str] | None)
Specifies which keys should use mean/standard deviation normalization instead of min-max normalization.action_configs (list[ActionConfig] | None)
Required for the"action" modality. Defines how each action modality should be interpreted and transformed. The list must have the same length as modality_keys.
Configuring each modality
Video modality
Defines which camera views to use:State modality
Defines proprioceptive observations (joint positions, gripper states, etc.):Action modality
Defines the action space and prediction horizon:Language modality
Defines which language annotations to use:Understanding ActionConfig
EachActionConfig has three required fields and one optional field.
rep (ActionRepresentation)
Defines how actions should be interpreted:RELATIVE: Actions are deltas from the current state (introduced in the UMI paper)ABSOLUTE: Actions are target positions
type (ActionType)
Specifies the control space:EEF: End-effector/Cartesian space control (expecting a 9-dimensional vector: x, y, z positions + rotation 6D)NON_EEF: Joint space control and other non-EEF control spaces (joint angles, positions, gripper positions, etc.)
format (ActionFormat)
Defines the action representation format:DEFAULT: Standard format (e.g., joint angles, gripper positions)XYZ_ROT6D: 3D position + 6D rotation representation for end-effector controlXYZ_ROTVEC: 3D position + rotation vector for end-effector control
state_key (str | None)
Optional. Specifies the corresponding reference state key for computing relative actions whenrep=RELATIVE. If not provided, the system will use the action key as the reference state key.
Example with state_key:
Complete example: SO-100
Here’s the complete SO-100 configuration:examples/SO100/so100_config.py
Key relationships with meta/modality.json
The modality configuration’smodality_keys must reference keys that exist in your dataset’s meta/modality.json.
Example meta/modality.json:
- Use
modality_keysto look up the corresponding entries inmeta/modality.json - Extract the correct slices from the concatenated state/action arrays
- Apply the specified transformations (normalization, action representation conversion)
Registering your configuration
After defining your configuration, register it so it’s available to the training and inference pipelines:--modality-config-path argument when running the fine-tuning script.