Gr00tN1d6
The main GR00T model class that combines a vision-language backbone with an action head for flow matching diffusion-based policy learning.Initialization
Model configuration object containing all hyperparameters
Dictionary with transformers loading parameters:
trust_remote_code: Whether to trust remote code when loading from HuggingFace Hublocal_files_only: Whether to only use local filesrevision: Specific model revision to usecache_dir: Directory to cache downloaded modelstoken: HuggingFace access token for gated models
Loading from pretrained
The recommended way to load a GR00T model is using theAutoModel.from_pretrained method:
During training, transformers parameters are passed from the training config. During inference (e.g., from_pretrained), defaults are used.
Methods
forward
Forward pass through the complete model for training.Dictionary containing:
- Vision-language inputs (images, text, attention masks)
- Action inputs (state, action, embodiment_id, action_mask)
BatchFeature containing:
loss: Combined action prediction lossaction_loss: Per-timestep action lossaction_mask: Mask for valid actionsbackbone_features: Vision-language embeddingsstate_features: Encoded state features
get_action
Generate actions using flow matching diffusion process for inference.Dictionary containing:
- Vision-language inputs (images, text, attention masks)
- State and embodiment_id (no ground truth actions needed)
BatchFeature containing:
action_pred: Predicted actions tensor of shape(B, action_horizon, action_dim)backbone_features: Vision-language embeddingsstate_features: Encoded state features
prepare_input
Prepare inputs for backbone and action head.Raw input dictionary containing vision-language and action data
Tuple of (backbone_inputs, action_inputs) ready for model forward pass
Properties
Device where model parameters are located
Data type of model parameters
Model configuration object
Architecture
The Gr00tN1d6 model consists of two main components:-
Backbone (
EagleBackbone): Vision-language encoder that processes images and text- Uses NVIDIA Eagle vision-language model
- Configurable layer selection and fine-tuning
- Optional flash attention for efficiency
-
Action Head (
Gr00tN1d6ActionHead): Flow matching diffusion model for action generation- DiT or AlternateVLDiT transformer architecture
- Embodiment-conditioned MLPs for state/action encoding
- Beta distribution noise schedule for flow matching
- Supports multi-embodiment learning (up to 32 embodiments)
Supported embodiments
The model supports multiple robot embodiments through embodiment-specific projectors:Training configuration
Control which parts of the model are trainable:Flow matching diffusion
The action head uses flow matching for action generation: Training:- Sample time
t ~ Beta(α=1.5, β=1.0) - Create noisy trajectory:
x_t = (1-t) * noise + t * action - Predict velocity:
v = action - noise - Minimize MSE:
L = ||v_pred - v||²
- Start with random noise
- Iteratively denoise using Euler integration over
num_inference_timestepssteps - Return final denoised actions
Model registration
The model is automatically registered with HuggingFace Transformers:See also
- Model configuration - Configuration classes for the model
- Processor - Data preprocessing and collation
- Policy - High-level policy interface for inference