Model configuration

Gr00tN1d6Config

Unified configuration dataclass for the Gr00tN1d6 model, combining backbone and action head parameters.

from gr00t.configs.model.gr00t_n1d6 import Gr00tN1d6Config

config = Gr00tN1d6Config(
    model_name="nvidia/Eagle-Block2A-2B-v2",
    action_horizon=16,
    max_action_dim=29,
    max_state_dim=29,
    tune_llm=False,
    tune_visual=False,
    tune_projector=True,
    tune_diffusion_model=True,
)

Model identification

model_type

str

default:"'Gr00tN1d6'"

HuggingFace model type identifier

model_dtype

str

default:"'bfloat16'"

Model data type (use bfloat16 for Flash Attention compatibility)

Backbone configuration

model_name

str

default:"'nvidia/Eagle-Block2A-2B-v2'"

HuggingFace model name or path for the vision-language backbone

backbone_model_type

str

default:"'eagle'"

Type of backbone model architecture

model_revision

str | None

default:"None"

Specific model revision to use from HuggingFace Hub

backbone_embedding_dim

int

default:"2048"

Dimension of backbone output embeddings (project_to_dim)

tune_llm

bool

default:"False"

Whether to fine-tune the LLM component of the backbone

tune_visual

bool

default:"False"

Whether to fine-tune the visual encoder of the backbone

tune_top_llm_layers

int

default:"4"

Number of top LLM layers to tune (when tune_llm is True)

select_layer

int

default:"16"

Which layer to extract features from in the backbone

reproject_vision

bool

default:"False"

Whether to reproject vision features to a different dimension

use_flash_attention

bool

default:"True"

Enable Flash Attention for efficient attention computation

load_bf16

bool

default:"True"

Load backbone weights in bfloat16 precision

backbone_trainable_params_fp32

bool

default:"True"

Keep trainable backbone parameters in FP32 for numerical stability

eagle_collator

bool

default:"False"

Use Eagle-specific collator that allows dynamic image size changes (needed for any-resolution)

Processing parameters

image_crop_size

tuple[int, int] | None

default:"None"

Target crop size for images (height, width)

image_target_size

tuple[int, int] | None

default:"None"

Target resize for images before cropping (height, width)

shortest_image_edge

int | None

default:"256"

Resize shortest edge of image to this size

crop_fraction

float | None

default:"0.95"

Fraction of image to keep when center cropping

random_rotation_angle

int | None

default:"None"

Maximum rotation angle (in degrees) for data augmentation

color_jitter_params

dict[str, float] | None

default:"None"

Parameters for color jitter augmentation (brightness, contrast, saturation, hue)

use_albumentations_transforms

bool

default:"True"

Use Albumentations library for image augmentation (vs torchvision)

formalize_language

bool

default:"True"

Lowercase and remove punctuation from language instructions

apply_sincos_state_encoding

bool

default:"False"

Apply sin/cos encoding to state features per-embodiment

use_relative_action

bool

default:"False"

Use relative actions instead of absolute actions

Action head dimensions

max_state_dim

int

default:"29"

Maximum state dimension across all embodiments (for padding)

max_action_dim

int

default:"29"

Maximum action dimension across all embodiments (for padding)

action_horizon

int

default:"16"

Number of future action steps to predict

hidden_size

int

default:"1024"

Hidden dimension for action head MLPs

input_embedding_dim

int

default:"1536"

Embedding dimension for state and action inputs to DiT

Diffusion model architecture

use_alternate_vl_dit

bool

default:"True"

Use AlternateVLDiT (True) or standard DiT (False)

attend_text_every_n_blocks

int

default:"2"

Attend to text features every N transformer blocks (for AlternateVLDiT)

diffusion_model_cfg

dict

Configuration for the DiT transformer:

positional_embeddings: Type of positional embeddings (None for learned)
num_layers: Number of transformer layers (32 for N1D6)
num_attention_heads: Number of attention heads (32)
attention_head_dim: Dimension per attention head (48)
norm_type: Normalization type (“ada_norm” for adaptive layer norm)
dropout: Dropout probability (0.2)
final_dropout: Apply dropout before final layer (True)
output_dim: Output dimension (1024)
interleave_self_attention: Interleave self-attention and cross-attention (True)

Default diffusion_model_cfg:

{
    "positional_embeddings": None,
    "num_layers": 32,
    "num_attention_heads": 32,
    "attention_head_dim": 48,
    "norm_type": "ada_norm",
    "dropout": 0.2,
    "final_dropout": True,
    "output_dim": 1024,
    "interleave_self_attention": True,
}

Global architecture parameters

add_pos_embed

bool

default:"True"

Add learned positional embeddings to action sequences

attn_dropout

float

default:"0.2"

Dropout probability for attention layers

use_vlln

bool

default:"True"

Apply layer normalization to vision-language features

max_seq_len

int

default:"1024"

Maximum sequence length for positional embeddings

Flow matching parameters

num_inference_timesteps

int

default:"4"

Number of denoising steps during inference

noise_beta_alpha

float

default:"1.5"

Alpha parameter for Beta distribution noise schedule

noise_beta_beta

float

default:"1.0"

Beta parameter for Beta distribution noise schedule

noise_s

float

default:"0.999"

Noise scaling factor: t = (1 - beta_sample) * noise_s

num_timestep_buckets

int

default:"1000"

Number of discrete timestep buckets for diffusion

Training parameters

tune_projector

bool

default:"True"

Fine-tune state encoder, action encoder, and action decoder

tune_diffusion_model

bool

default:"True"

Fine-tune the DiT transformer in the action head

tune_vlln

bool

default:"True"

Fine-tune the vision-language layer normalization

state_dropout_prob

float

default:"0.0"

Probability of dropping out state features during training

state_additive_noise_scale

float

default:"0.0"

Scale of additive Gaussian noise on state features during training

Multi-embodiment parameters

max_num_embodiments

int

default:"32"

Maximum number of embodiments the model can support

Methods

to_filtered_dict

Return a dictionary representation, optionally excluding augmentation parameters.

def to_filtered_dict(self, exclude_augment: bool = True) -> dict

exclude_augment

bool

default:"True"

Whether to exclude augmentation-related keys from the dictionary

return

dict

Dictionary representation of the configuration

Example:

config = Gr00tN1d6Config()
config_dict = config.to_filtered_dict(exclude_augment=True)

to_filtered_json

Return a JSON string representation, optionally excluding augmentation parameters.

def to_filtered_json(self, exclude_augment: bool = True, **kwargs) -> str

exclude_augment

bool

default:"True"

Whether to exclude augmentation-related keys from the JSON

**kwargs

dict

Additional arguments passed to json.dumps()

return

str

JSON string representation of the configuration

Example:

config = Gr00tN1d6Config()
json_str = config.to_filtered_json(exclude_augment=True, indent=2)
print(json_str)

Configuration from YAML

Configurations are typically loaded from YAML files during training:

model:
  model_type: "GrootN1d6"
  model_name: "nvidia/Eagle-Block2A-2B-v2"
  
  # Backbone settings
  tune_llm: false
  tune_visual: false
  tune_top_llm_layers: 4
  use_flash_attention: true
  
  # Action head settings
  action_horizon: 16
  max_action_dim: 29
  max_state_dim: 29
  tune_projector: true
  tune_diffusion_model: true
  
  # Flow matching
  num_inference_timesteps: 4
  noise_beta_alpha: 1.5
  noise_beta_beta: 1.0
  
  # Augmentation
  use_albumentations_transforms: true
  shortest_image_edge: 256
  crop_fraction: 0.95

Model registration

The configuration is automatically registered with the model registry:

from gr00t.configs.model import register_model_config
from gr00t.configs.model.gr00t_n1d6 import Gr00tN1d6Config

register_model_config("GrootN1d6", Gr00tN1d6Config)

Saving and loading

Configurations are automatically saved during training:

# During training setup
with open(save_cfg_dir / "final_model_config.json", "w") as f:
    f.write(model.config.to_filtered_json())

# Loading from checkpoint
from transformers import AutoConfig

config = AutoConfig.from_pretrained("path/to/checkpoint")

Backward compatibility

The config includes backward compatibility for legacy arguments:

# Legacy argument (deprecated)
config = Gr00tN1d6Config(collator_overwrite_image_inputs=True)

# Automatically mapped to:
config.eagle_collator == True

Policy

Data

Model

Training

Evaluation

Gr00tN1d6Config

Model identification

Backbone configuration

Processing parameters

Action head dimensions

Diffusion model architecture

Global architecture parameters

Flow matching parameters

Training parameters

Multi-embodiment parameters

Methods

to_filtered_dict

to_filtered_json

Configuration from YAML

Model registration

Saving and loading

Backward compatibility

See also

Build docs developers (and LLMs) love

Policy

Data

Model

Training

Evaluation

Documentation Index

​Gr00tN1d6Config

​Model identification

​Backbone configuration

​Processing parameters

​Action head dimensions

​Diffusion model architecture

​Global architecture parameters

​Flow matching parameters

​Training parameters

​Multi-embodiment parameters

​Methods

​to_filtered_dict

​to_filtered_json

​Configuration from YAML

​Model registration

​Saving and loading

​Backward compatibility

​See also

Build docs developers (and LLMs) love

Gr00tN1d6Config

Model identification

Backbone configuration

Processing parameters

Action head dimensions

Diffusion model architecture

Global architecture parameters

Flow matching parameters

Training parameters

Multi-embodiment parameters

Methods

to_filtered_dict

to_filtered_json

Configuration from YAML

Model registration

Saving and loading

Backward compatibility

See also