Gr00tN1d6Config
Unified configuration dataclass for the Gr00tN1d6 model, combining backbone and action head parameters.Model identification
HuggingFace model type identifier
Model data type (use bfloat16 for Flash Attention compatibility)
Backbone configuration
HuggingFace model name or path for the vision-language backbone
Type of backbone model architecture
Specific model revision to use from HuggingFace Hub
Dimension of backbone output embeddings (project_to_dim)
Whether to fine-tune the LLM component of the backbone
Whether to fine-tune the visual encoder of the backbone
Number of top LLM layers to tune (when tune_llm is True)
Which layer to extract features from in the backbone
Whether to reproject vision features to a different dimension
Enable Flash Attention for efficient attention computation
Load backbone weights in bfloat16 precision
Keep trainable backbone parameters in FP32 for numerical stability
Use Eagle-specific collator that allows dynamic image size changes (needed for any-resolution)
Processing parameters
Target crop size for images (height, width)
Target resize for images before cropping (height, width)
Resize shortest edge of image to this size
Fraction of image to keep when center cropping
Maximum rotation angle (in degrees) for data augmentation
Parameters for color jitter augmentation (brightness, contrast, saturation, hue)
Use Albumentations library for image augmentation (vs torchvision)
Lowercase and remove punctuation from language instructions
Apply sin/cos encoding to state features per-embodiment
Use relative actions instead of absolute actions
Action head dimensions
Maximum state dimension across all embodiments (for padding)
Maximum action dimension across all embodiments (for padding)
Number of future action steps to predict
Hidden dimension for action head MLPs
Embedding dimension for state and action inputs to DiT
Diffusion model architecture
Use AlternateVLDiT (True) or standard DiT (False)
Attend to text features every N transformer blocks (for AlternateVLDiT)
Configuration for the DiT transformer:
positional_embeddings: Type of positional embeddings (None for learned)num_layers: Number of transformer layers (32 for N1D6)num_attention_heads: Number of attention heads (32)attention_head_dim: Dimension per attention head (48)norm_type: Normalization type (“ada_norm” for adaptive layer norm)dropout: Dropout probability (0.2)final_dropout: Apply dropout before final layer (True)output_dim: Output dimension (1024)interleave_self_attention: Interleave self-attention and cross-attention (True)
Global architecture parameters
Add learned positional embeddings to action sequences
Dropout probability for attention layers
Apply layer normalization to vision-language features
Maximum sequence length for positional embeddings
Flow matching parameters
Number of denoising steps during inference
Alpha parameter for Beta distribution noise schedule
Beta parameter for Beta distribution noise schedule
Noise scaling factor:
t = (1 - beta_sample) * noise_sNumber of discrete timestep buckets for diffusion
Training parameters
Fine-tune state encoder, action encoder, and action decoder
Fine-tune the DiT transformer in the action head
Fine-tune the vision-language layer normalization
Probability of dropping out state features during training
Scale of additive Gaussian noise on state features during training
Multi-embodiment parameters
Maximum number of embodiments the model can support
Methods
to_filtered_dict
Return a dictionary representation, optionally excluding augmentation parameters.Whether to exclude augmentation-related keys from the dictionary
Dictionary representation of the configuration
to_filtered_json
Return a JSON string representation, optionally excluding augmentation parameters.Whether to exclude augmentation-related keys from the JSON
Additional arguments passed to
json.dumps()JSON string representation of the configuration
Configuration from YAML
Configurations are typically loaded from YAML files during training:Model registration
The configuration is automatically registered with the model registry:Saving and loading
Configurations are automatically saved during training:Backward compatibility
The config includes backward compatibility for legacy arguments:See also
- GR00T model class - Main model class using this configuration
- Training configuration - Training-specific configuration
- Data configuration - Data loading and processing configuration