GR00T N1.6 is a vision-language-action (VLA) model that combines a 2B parameter vision-language foundation model with a 32-layer diffusion transformer to generate continuous robot actions.
Architecture overview
The complete model consists of three main components:
Vision-language backbone : Processes multimodal observations (images, text, robot state)
Diffusion transformer : Denoises action trajectories through flow matching
Embodiment projectors : Encodes/decodes robot-specific states and actions
Vision-language backbone
The backbone is based on NVIDIA’s Cosmos-Reason-2B VLM variant, implemented in the EagleBackbone class.
Key features
Flexible resolution Processes images in their native aspect ratio without padding
Embodied reasoning Pre-trained on next action prediction and embodied tasks
Layer selection Uses intermediate layers for optimal feature extraction
Flash attention Efficient attention with flash-attention-2 implementation
Implementation
From gr00t/model/modules/eagle_backbone.py:
class EagleBackbone ( torch . nn . Module ):
def __init__ (
self ,
model_name : str = "nvidia/Eagle-Block2A-2B-v2" ,
tune_llm : bool = False ,
tune_visual : bool = False ,
select_layer : int = - 1 ,
tune_top_llm_layers : int = 0 ,
use_flash_attention : bool = False ,
load_bf16 : bool = False ,
):
# Load pre-trained VLM with flash attention
extra_kwargs = {}
if use_flash_attention:
extra_kwargs[ "attn_implementation" ] = "flash_attention_2"
if load_bf16:
extra_kwargs[ "torch_dtype" ] = torch.bfloat16
# Trim layers for efficiency (keep only up to select_layer)
while len ( self .model.language_model.model.layers) > select_layer:
self .model.language_model.model.layers.pop( - 1 )
Forward pass
def forward ( self , vl_input : BatchFeature) -> BatchFeature:
keys_to_use = [ "input_ids" , "attention_mask" , "pixel_values" ]
vl_input = {k: vl_input[k] for k in keys_to_use}
# Get hidden states from selected layer
outputs = self .model( ** vl_input, output_hidden_states = True )
outputs = outputs[ "hidden_states" ][ - 1 ]
# Track image token positions
image_mask = vl_input[ "input_ids" ] == self .model.config.image_token_index
attention_mask = vl_input[ "attention_mask" ] == 1
return BatchFeature(
data = {
"backbone_features" : outputs, # [B, seq_len, 2048]
"backbone_attention_mask" : attention_mask,
"image_mask" : image_mask,
}
)
Output shape : [batch_size, sequence_length, 2048] where sequence_length includes image tokens and text tokens.
Action head architecture
The Gr00tN1d6ActionHead generates actions through flow matching diffusion, implemented with a 32-layer transformer.
Components
State encoder
Action encoder
Diffusion model
Action decoder
CategorySpecificMLP : Encodes proprioceptive state per embodimentself .state_encoder = CategorySpecificMLP(
num_categories = config.max_num_embodiments,
input_dim = config.max_state_dim,
hidden_dim = self .hidden_size,
output_dim = self .input_embedding_dim,
)
Supports multi-embodiment training with separate weights per robot type. MultiEmbodimentActionEncoder : Embeds noisy action trajectories with timestep conditioningself .action_encoder = MultiEmbodimentActionEncoder(
action_dim = self .action_dim,
hidden_size = self .input_embedding_dim,
num_embodiments = config.max_num_embodiments,
)
Uses sinusoidal positional encoding for diffusion timesteps. AlternateVLDiT : 32-layer transformer with alternating self and cross-attentionself .model = AlternateVLDiT(
** config.diffusion_model_cfg,
cross_attention_dim = config.backbone_embedding_dim,
attend_text_every_n_blocks = config.attend_text_every_n_blocks,
)
Separates image and text tokens for efficient cross-attention. CategorySpecificMLP : Decodes hidden states to action space per embodimentself .action_decoder = CategorySpecificMLP(
num_categories = config.max_num_embodiments,
input_dim = self .hidden_size,
hidden_dim = self .hidden_size,
output_dim = self .action_dim,
)
Projects DiT outputs to robot-specific action dimensions.
The core of the action head is a 32-layer diffusion transformer with cross-attention to VLM features.
Architecture details
From gr00t/model/modules/dit.py:172-196:
class DiT ( ModelMixin , ConfigMixin ):
def __init__ (
self ,
num_attention_heads : int = 8 ,
attention_head_dim : int = 64 ,
output_dim : int = 26 ,
num_layers : int = 32 , # N1.6 uses 32 layers (vs 16 in N1.5)
dropout : float = 0.1 ,
norm_type : str = "ada_norm" , # Adaptive layer norm with timestep conditioning
cross_attention_dim : Optional[ int ] = None ,
):
# Timestep encoder using sinusoidal embeddings
self .timestep_encoder = TimestepEncoder(
embedding_dim = self .inner_dim
)
# 32 transformer blocks with alternating attention patterns
all_blocks = []
for idx in range (num_layers):
all_blocks.append(
BasicTransformerBlock(
self .inner_dim,
num_attention_heads,
attention_head_dim,
cross_attention_dim = cross_attention_dim,
norm_type = norm_type,
)
)
self .transformer_blocks = nn.ModuleList(all_blocks)
Each BasicTransformerBlock contains:
Adaptive layer norm : Conditioned on diffusion timestep
Cross-attention : Attends to VLM features (vision + language)
Feed-forward : MLP with GELU activation
Residual connections : Skip connections for gradient flow
class BasicTransformerBlock ( nn . Module ):
def forward (
self ,
hidden_states : torch.Tensor, # [B, T, D] - action embeddings
encoder_hidden_states : torch.Tensor, # [B, S, D] - VLM features
temb : torch.Tensor, # [B, D] - timestep embedding
) -> torch.Tensor:
# Adaptive layer norm with timestep conditioning
norm_hidden_states = self .norm1(hidden_states, temb)
# Cross-attention to VLM features
attn_output = self .attn1(
norm_hidden_states,
encoder_hidden_states = encoder_hidden_states,
)
hidden_states = attn_output + hidden_states
# Feed-forward network
ff_output = self .ff( self .norm3(hidden_states))
hidden_states = ff_output + hidden_states
return hidden_states
AlternateVLDiT variant
N1.6 uses an enhanced DiT that separates image and text tokens during cross-attention:
class AlternateVLDiT ( DiT ):
def forward (
self ,
hidden_states : torch.Tensor,
encoder_hidden_states : torch.Tensor,
image_mask : torch.Tensor, # Indicates which tokens are images
backbone_attention_mask : torch.Tensor,
):
# Create separate masks for image and text tokens
image_attention_mask = image_mask & backbone_attention_mask
non_image_attention_mask = ( ~ image_mask) & backbone_attention_mask
for idx, block in enumerate ( self .transformer_blocks):
if idx % 2 == 1 :
# Self-attention blocks
hidden_states = block(hidden_states, encoder_hidden_states = None )
else :
# Alternate between text and image cross-attention
if idx % ( 2 * self .attend_text_every_n_blocks) == 0 :
mask = non_image_attention_mask # Attend to text
else :
mask = image_attention_mask # Attend to images
hidden_states = block(
hidden_states,
encoder_hidden_states = encoder_hidden_states,
encoder_attention_mask = mask,
)
Alternating between image and text attention improves efficiency and allows the model to specialize different layers for visual vs. linguistic reasoning.
Flow matching diffusion
GR00T uses flow matching for action generation, a continuous-time diffusion process.
Training objective
From gr00t/model/gr00t_n1d6/gr00t_n1d6.py:197-248:
def forward ( self , backbone_output : BatchFeature, action_input : BatchFeature):
# Sample timestep from beta distribution
t = self .sample_time(batch_size, device, dtype)
# Create noisy action trajectory
noise = torch.randn_like(actions)
noisy_trajectory = ( 1 - t) * noise + t * actions
# Ground truth velocity field
velocity = actions - noise
# Encode noisy actions with timestep
action_features = self .action_encoder(noisy_trajectory, t_discretized, embodiment_id)
# Concatenate with state features
sa_embs = torch.cat((state_features, action_features), dim = 1 )
# Forward through DiT
model_output = self .model(
hidden_states = sa_embs,
encoder_hidden_states = vl_embeds,
timestep = t_discretized,
)
# Decode to action space
pred_velocity = self .action_decoder(model_output, embodiment_id)
# MSE loss on velocity field
action_loss = F.mse_loss(pred_velocity, velocity) * action_mask
loss = action_loss.sum() / (action_mask.sum() + 1e-6 )
Training process:
Sample timestep t ~ Beta(α, β)
Interpolate between noise and ground truth: x_t = (1-t) * noise + t * action
Predict velocity: v = action - noise
Train model to predict velocity at timestep t
Inference (denoising)
@torch.no_grad ()
def get_action ( self , backbone_output , action_input ):
# Initialize with random noise
actions = torch.randn( size = (B, action_horizon, action_dim))
dt = 1.0 / num_inference_timesteps # e.g., 1/4 = 0.25
# Iterative denoising (default: 4 steps)
for t in range (num_inference_timesteps):
t_cont = t / num_inference_timesteps # 0, 0.25, 0.5, 0.75
# Encode current action estimate
action_features = self .action_encoder(actions, t_discretized, embodiment_id)
sa_embs = torch.cat((state_features, action_features), dim = 1 )
# Predict velocity
model_output = self .model(sa_embs, vl_embeds, t_discretized)
pred_velocity = self .action_decoder(model_output, embodiment_id)
# Euler integration: x_{t+1} = x_t + dt * v_t
actions = actions + dt * pred_velocity
return actions # Final denoised actions
GR00T uses 4 denoising steps by default, balancing quality and inference speed (16ms action head latency on RTX 5090).
Embodiment-specific projectors
GR00T supports multiple robot embodiments through category-specific linear layers.
CategorySpecificLinear
From gr00t/model/modules/embodiment_conditioned_mlp.py:44-64:
class CategorySpecificLinear ( nn . Module ):
def __init__ ( self , num_categories , input_dim , hidden_dim ):
super (). __init__ ()
# Separate weights and biases for each embodiment
self .W = nn.Parameter( 0.02 * torch.randn(num_categories, input_dim, hidden_dim))
self .b = nn.Parameter(torch.zeros(num_categories, hidden_dim))
def forward ( self , x , cat_ids ):
# Select weights for each batch item's embodiment
selected_W = self .W[cat_ids] # [B, input_dim, hidden_dim]
selected_b = self .b[cat_ids] # [B, hidden_dim]
# Batch matrix multiply
return torch.bmm(x, selected_W) + selected_b.unsqueeze( 1 )
Example : With 10 embodiments, a single CategorySpecificLinear(10, 14, 512) layer stores 10 separate weight matrices of shape [14, 512].
Multi-embodiment action encoder
class MultiEmbodimentActionEncoder ( nn . Module ):
def __init__ ( self , action_dim , hidden_size , num_embodiments ):
self .W1 = CategorySpecificLinear(num_embodiments, action_dim, hidden_size)
self .W2 = CategorySpecificLinear(num_embodiments, 2 * hidden_size, hidden_size)
self .W3 = CategorySpecificLinear(num_embodiments, hidden_size, hidden_size)
self .pos_encoding = SinusoidalPositionalEncoding(hidden_size)
def forward ( self , actions , timesteps , cat_ids ):
# Embed actions with embodiment-specific projection
a_emb = self .W1(actions, cat_ids) # [B, T, hidden_size]
# Add sinusoidal timestep encoding
tau_emb = self .pos_encoding(timesteps) # [B, T, hidden_size]
# Combine with embodiment-specific MLPs
x = torch.cat([a_emb, tau_emb], dim =- 1 ) # [B, T, 2*hidden_size]
x = swish( self .W2(x, cat_ids))
x = self .W3(x, cat_ids)
return x # [B, T, hidden_size]
This architecture allows a single model to handle different robot configurations (7-DOF arms, 14-DOF humanoids, etc.) by learning embodiment-specific projections.
Model configuration
Key hyperparameters for the full architecture:
Parameter Value Description backbone_embedding_dim2048 VLM hidden size input_embedding_dim512 Action head embedding size hidden_size512 DiT hidden dimension num_layers32 DiT transformer layers num_attention_heads8 Attention heads per layer attention_head_dim64 Dimension per attention head action_horizon16 Number of action steps predicted num_inference_timesteps4 Denoising steps during inference max_action_dim32 Maximum action dimension across embodiments max_state_dim32 Maximum state dimension max_num_embodiments10 Number of supported robot types
Memory and computation
Model size
Total parameters ~3B parameters (2B VLM + 1B action head)
GPU memory ~12 GB for inference (bfloat16)
Inference breakdown
On RTX 5090 with torch.compile:
# From README.md inference timing table
Data Processing: 2 ms # Image preprocessing, tokenization
Backbone: 18 ms # VLM forward pass
Action Head: 16 ms # 4 DiT denoising steps
─────────────────────
End - to - End: 37 ms # 27.3 Hz throughput
Training configuration
Recommended settings:
# From gr00t/experiment/launch_finetune.py
CUDA_VISIBLE_DEVICES = 0 uv run python gr00t / experiment / launch_finetune.py \
-- base - model - path nvidia / GR00T - N1.6 - 3B \
-- global - batch - size 32 \
-- max - steps 2000 \
-- save - steps 2000 \
-- dataloader - num - workers 4
Memory requirements:
1x H100 (80GB): Batch size 32-64
1x L40 (48GB): Batch size 16-32
1x RTX 4090 (24GB): Batch size 8-16
Summary
The GR00T N1.6 architecture combines:
Foundation model : 2B parameter VLM with flexible vision encoding
Scalable diffusion : 32-layer transformer for high-capacity action modeling
Multi-embodiment : Category-specific projectors for cross-robot learning
Efficient inference : 27 Hz throughput with 4-step denoising
This design enables few-shot adaptation to new tasks while maintaining strong zero-shot generalization across diverse robot embodiments.