Skip to main content

Gr00tN1d6Config

Unified configuration dataclass for the Gr00tN1d6 model, combining backbone and action head parameters.
from gr00t.configs.model.gr00t_n1d6 import Gr00tN1d6Config

config = Gr00tN1d6Config(
    model_name="nvidia/Eagle-Block2A-2B-v2",
    action_horizon=16,
    max_action_dim=29,
    max_state_dim=29,
    tune_llm=False,
    tune_visual=False,
    tune_projector=True,
    tune_diffusion_model=True,
)

Model identification

model_type
str
default:"'Gr00tN1d6'"
HuggingFace model type identifier
model_dtype
str
default:"'bfloat16'"
Model data type (use bfloat16 for Flash Attention compatibility)

Backbone configuration

model_name
str
default:"'nvidia/Eagle-Block2A-2B-v2'"
HuggingFace model name or path for the vision-language backbone
backbone_model_type
str
default:"'eagle'"
Type of backbone model architecture
model_revision
str | None
default:"None"
Specific model revision to use from HuggingFace Hub
backbone_embedding_dim
int
default:"2048"
Dimension of backbone output embeddings (project_to_dim)
tune_llm
bool
default:"False"
Whether to fine-tune the LLM component of the backbone
tune_visual
bool
default:"False"
Whether to fine-tune the visual encoder of the backbone
tune_top_llm_layers
int
default:"4"
Number of top LLM layers to tune (when tune_llm is True)
select_layer
int
default:"16"
Which layer to extract features from in the backbone
reproject_vision
bool
default:"False"
Whether to reproject vision features to a different dimension
use_flash_attention
bool
default:"True"
Enable Flash Attention for efficient attention computation
load_bf16
bool
default:"True"
Load backbone weights in bfloat16 precision
backbone_trainable_params_fp32
bool
default:"True"
Keep trainable backbone parameters in FP32 for numerical stability
eagle_collator
bool
default:"False"
Use Eagle-specific collator that allows dynamic image size changes (needed for any-resolution)

Processing parameters

image_crop_size
tuple[int, int] | None
default:"None"
Target crop size for images (height, width)
image_target_size
tuple[int, int] | None
default:"None"
Target resize for images before cropping (height, width)
shortest_image_edge
int | None
default:"256"
Resize shortest edge of image to this size
crop_fraction
float | None
default:"0.95"
Fraction of image to keep when center cropping
random_rotation_angle
int | None
default:"None"
Maximum rotation angle (in degrees) for data augmentation
color_jitter_params
dict[str, float] | None
default:"None"
Parameters for color jitter augmentation (brightness, contrast, saturation, hue)
use_albumentations_transforms
bool
default:"True"
Use Albumentations library for image augmentation (vs torchvision)
formalize_language
bool
default:"True"
Lowercase and remove punctuation from language instructions
apply_sincos_state_encoding
bool
default:"False"
Apply sin/cos encoding to state features per-embodiment
use_relative_action
bool
default:"False"
Use relative actions instead of absolute actions

Action head dimensions

max_state_dim
int
default:"29"
Maximum state dimension across all embodiments (for padding)
max_action_dim
int
default:"29"
Maximum action dimension across all embodiments (for padding)
action_horizon
int
default:"16"
Number of future action steps to predict
hidden_size
int
default:"1024"
Hidden dimension for action head MLPs
input_embedding_dim
int
default:"1536"
Embedding dimension for state and action inputs to DiT

Diffusion model architecture

use_alternate_vl_dit
bool
default:"True"
Use AlternateVLDiT (True) or standard DiT (False)
attend_text_every_n_blocks
int
default:"2"
Attend to text features every N transformer blocks (for AlternateVLDiT)
diffusion_model_cfg
dict
Configuration for the DiT transformer:
  • positional_embeddings: Type of positional embeddings (None for learned)
  • num_layers: Number of transformer layers (32 for N1D6)
  • num_attention_heads: Number of attention heads (32)
  • attention_head_dim: Dimension per attention head (48)
  • norm_type: Normalization type (“ada_norm” for adaptive layer norm)
  • dropout: Dropout probability (0.2)
  • final_dropout: Apply dropout before final layer (True)
  • output_dim: Output dimension (1024)
  • interleave_self_attention: Interleave self-attention and cross-attention (True)
Default diffusion_model_cfg:
{
    "positional_embeddings": None,
    "num_layers": 32,
    "num_attention_heads": 32,
    "attention_head_dim": 48,
    "norm_type": "ada_norm",
    "dropout": 0.2,
    "final_dropout": True,
    "output_dim": 1024,
    "interleave_self_attention": True,
}

Global architecture parameters

add_pos_embed
bool
default:"True"
Add learned positional embeddings to action sequences
attn_dropout
float
default:"0.2"
Dropout probability for attention layers
use_vlln
bool
default:"True"
Apply layer normalization to vision-language features
max_seq_len
int
default:"1024"
Maximum sequence length for positional embeddings

Flow matching parameters

num_inference_timesteps
int
default:"4"
Number of denoising steps during inference
noise_beta_alpha
float
default:"1.5"
Alpha parameter for Beta distribution noise schedule
noise_beta_beta
float
default:"1.0"
Beta parameter for Beta distribution noise schedule
noise_s
float
default:"0.999"
Noise scaling factor: t = (1 - beta_sample) * noise_s
num_timestep_buckets
int
default:"1000"
Number of discrete timestep buckets for diffusion

Training parameters

tune_projector
bool
default:"True"
Fine-tune state encoder, action encoder, and action decoder
tune_diffusion_model
bool
default:"True"
Fine-tune the DiT transformer in the action head
tune_vlln
bool
default:"True"
Fine-tune the vision-language layer normalization
state_dropout_prob
float
default:"0.0"
Probability of dropping out state features during training
state_additive_noise_scale
float
default:"0.0"
Scale of additive Gaussian noise on state features during training

Multi-embodiment parameters

max_num_embodiments
int
default:"32"
Maximum number of embodiments the model can support

Methods

to_filtered_dict

Return a dictionary representation, optionally excluding augmentation parameters.
def to_filtered_dict(self, exclude_augment: bool = True) -> dict
exclude_augment
bool
default:"True"
Whether to exclude augmentation-related keys from the dictionary
return
dict
Dictionary representation of the configuration
Example:
config = Gr00tN1d6Config()
config_dict = config.to_filtered_dict(exclude_augment=True)

to_filtered_json

Return a JSON string representation, optionally excluding augmentation parameters.
def to_filtered_json(self, exclude_augment: bool = True, **kwargs) -> str
exclude_augment
bool
default:"True"
Whether to exclude augmentation-related keys from the JSON
**kwargs
dict
Additional arguments passed to json.dumps()
return
str
JSON string representation of the configuration
Example:
config = Gr00tN1d6Config()
json_str = config.to_filtered_json(exclude_augment=True, indent=2)
print(json_str)

Configuration from YAML

Configurations are typically loaded from YAML files during training:
model:
  model_type: "GrootN1d6"
  model_name: "nvidia/Eagle-Block2A-2B-v2"
  
  # Backbone settings
  tune_llm: false
  tune_visual: false
  tune_top_llm_layers: 4
  use_flash_attention: true
  
  # Action head settings
  action_horizon: 16
  max_action_dim: 29
  max_state_dim: 29
  tune_projector: true
  tune_diffusion_model: true
  
  # Flow matching
  num_inference_timesteps: 4
  noise_beta_alpha: 1.5
  noise_beta_beta: 1.0
  
  # Augmentation
  use_albumentations_transforms: true
  shortest_image_edge: 256
  crop_fraction: 0.95

Model registration

The configuration is automatically registered with the model registry:
from gr00t.configs.model import register_model_config
from gr00t.configs.model.gr00t_n1d6 import Gr00tN1d6Config

register_model_config("GrootN1d6", Gr00tN1d6Config)

Saving and loading

Configurations are automatically saved during training:
# During training setup
with open(save_cfg_dir / "final_model_config.json", "w") as f:
    f.write(model.config.to_filtered_json())

# Loading from checkpoint
from transformers import AutoConfig

config = AutoConfig.from_pretrained("path/to/checkpoint")

Backward compatibility

The config includes backward compatibility for legacy arguments:
# Legacy argument (deprecated)
config = Gr00tN1d6Config(collator_overwrite_image_inputs=True)

# Automatically mapped to:
config.eagle_collator == True

See also

Build docs developers (and LLMs) love