Skip to main content

Overview

The Gr00tPolicy class handles end-to-end inference with GR00T models. It loads a pre-trained vision-language-action model, processes observations, and generates robot actions.

Class definition

gr00t/policy/gr00t_policy.py
class Gr00tPolicy(BasePolicy):
    """Core policy class for Gr00t model inference.
    
    This policy handles the end-to-end inference pipeline:
    1. Validates input observations
    2. Processes observations with pretrained VLA processor
    3. Runs model inference
    4. Decodes and returns actions
    """

Constructor

embodiment_tag
EmbodimentTag
required
The embodiment tag defining the robot/environment type (e.g., EmbodimentTag.GR1, EmbodimentTag.UNITREE_G1)
model_path
str
required
Path to the pretrained model checkpoint directory or HuggingFace model ID (e.g., "nvidia/GR00T-N1.6-3B")
device
int | str
required
Device to run the model on (e.g., 'cuda:0', 0, 'cpu')
strict
bool
default:"True"
Whether to enforce strict input validation

Methods

get_action

Generate actions from observations.
def get_action(
    self, 
    observation: dict[str, Any], 
    options: dict[str, Any] | None = None
) -> tuple[dict[str, Any], dict[str, Any]]
observation
dict[str, Any]
required
Observation dictionary with the following structure:
  • video: dict[str, np.ndarray[np.uint8, (B, T, H, W, C)]]
  • state: dict[str, np.ndarray[np.float32, (B, T, D)]]
  • language: dict[str, list[list[str]]]
options
dict[str, Any] | None
Optional parameters (currently unused)
actions
dict[str, np.ndarray]
Dictionary of action arrays with shape (B, T, D) where:
  • B: batch size
  • T: action horizon
  • D: action dimension
info
dict[str, Any]
Additional information dictionary (currently empty)

get_modality_config

Get the modality configuration for the current embodiment.
def get_modality_config(self) -> dict[str, ModalityConfig]
modality_configs
dict[str, ModalityConfig]
Dictionary mapping modality names ("video", "state", "action", "language") to their configurations

reset

Reset the policy to its initial state.
def reset(self, options: dict[str, Any] | None = None) -> dict[str, Any]
options
dict[str, Any] | None
Optional reset parameters
info
dict[str, Any]
Information dictionary after reset (currently empty)

check_observation

Validate observation structure and types.
def check_observation(self, observation: dict[str, Any]) -> None
observation
dict[str, Any]
required
Observation to validate
This method raises AssertionError if validation fails. It checks:
  • All required modalities are present (video, state, language)
  • Data types match expectations (uint8 for video, float32 for state)
  • Shapes match the modality configuration
  • Temporal dimensions are consistent

check_action

Validate action structure and types.
def check_action(self, action: dict[str, Any]) -> None
action
dict[str, Any]
required
Action dictionary to validate

Usage example

from gr00t.policy.gr00t_policy import Gr00tPolicy
from gr00t.data.embodiment_tags import EmbodimentTag
import numpy as np

# Initialize policy
policy = Gr00tPolicy(
    embodiment_tag=EmbodimentTag.GR1,
    model_path="nvidia/GR00T-N1.6-3B",
    device="cuda:0",
    strict=True
)

# Get modality configuration
modality_configs = policy.get_modality_config()
video_config = modality_configs["video"]
state_config = modality_configs["state"]

# Prepare observation
observation = {
    "video": {
        "head_camera": np.zeros((1, 1, 224, 224, 3), dtype=np.uint8),
    },
    "state": {
        "joint_positions": np.zeros((1, 1, 14), dtype=np.float32),
    },
    "language": {
        "task": [["pick up the apple"]]
    }
}

# Generate action
action, info = policy.get_action(observation)
print(f"Action shape: {action['joint_positions'].shape}")
# Output: Action shape: (1, 8, 14)

Observation format

The policy expects observations in the following nested dictionary format:
observation = {
    "video": {
        "camera_name": np.ndarray[np.uint8, (B, T, H, W, C)],
        # ... more cameras
    },
    "state": {
        "state_name": np.ndarray[np.float32, (B, T, D)],
        # ... more states
    },
    "language": {
        "task": [["instruction string"]],  # Shape: (B, T)
    }
}
Where:
  • B: Batch size
  • T: Temporal horizon (number of frames/timesteps)
  • H, W: Image height and width
  • C: Number of channels (must be 3 for RGB)
  • D: State dimension

Action format

Actions are returned in a similar nested format:
action = {
    "action_name": np.ndarray[np.float32, (B, T, D)],
    # ... more actions
}
Where:
  • B: Batch size
  • T: Action horizon (number of future timesteps)
  • D: Action dimension

Properties

model
Gr00tN1d6
The loaded GR00T model in evaluation mode with bfloat16 precision
processor
BaseProcessor
The processor for input/output transformation
embodiment_tag
EmbodimentTag
The embodiment tag for this policy
modality_configs
dict[str, ModalityConfig]
Modality configurations for the current embodiment
collate_fn
Callable
Collation function for batching observations
language_key
str
The language modality key (currently only one is supported)

See also

Gr00tSimPolicyWrapper

Wrapper for GR00T simulation environments

PolicyClient

Client for remote inference

Policy API guide

Complete guide to using the policy API

EmbodimentTag

Available embodiment tags

Build docs developers (and LLMs) love