Skip to main content
To use the older GR00T N1.5 version, checkout the n1.5-release branch.
GR00T N1.6 represents a significant upgrade over GR00T N1.5, with improvements in both model architecture and training data leading to better performance in many aspects.

Model architecture improvements

Vision-language backbone

New VLM foundation

N1.6 uses an internal NVIDIA Cosmos-Reason-2B VLM variant with flexible resolution support and native aspect ratio encoding (no padding required).
The new vision-language model is trained on both general vision-language tasks and embodied reasoning tasks like next action prediction, providing stronger grounding for robotic control. Key advantages:
  • Flexible resolution processing without padding artifacts
  • Native aspect ratio support for diverse camera configurations
  • Embodied reasoning capabilities built into the foundation
  • Improved visual understanding for manipulation tasks

Diffusion transformer scaling

32-layer DiTThe action head uses a 32-layer diffusion transformer for increased model capacity and better action prediction.
# From gr00t/model/gr00t_n1d6/gr00t_n1d6.py
self.model = AlternateVLDiT(
    **config.diffusion_model_cfg,
    cross_attention_dim=config.backbone_embedding_dim,
    attend_text_every_n_blocks=config.attend_text_every_n_blocks,
)

Architecture simplification

Top-layer tuning

N1.6 removes the 4-layer transformer adapter after the VLM. Instead, the top 4 layers of the VLM are unfrozen during pretraining for more efficient fine-tuning.
Code reference: gr00t/model/modules/eagle_backbone.py:75-78
if tune_top_llm_layers > 0:
    for layer in self.model.language_model.model.layers[-tune_top_llm_layers:]:
        for param in layer.parameters():
            param.requires_grad = True

State-relative action prediction

N1.6 predicts state-relative action chunks for most embodiments, rather than absolute joint angles or end-effector positions used in N1.5.
This architectural change improves:
  • Generalization across different initial configurations
  • Transfer learning between similar embodiments
  • Robustness to calibration differences

Training data expansion

Beyond the N1.5 data mixture, the N1.6 pretraining data additionally includes several thousand hours of teleoperated data from:

Bimanual YAM arms

High-quality bimanual manipulation demonstrations

AGIBot Genie1

Humanoid robot teleoperation data

Galaxea R1 Pro (BEHAVIOR)

Simulated whole-body loco-manipulation on BEHAVIOR-1K suite

Unitree G1

Whole-body locomanipulation demonstrations
The expanded dataset provides:
  • Greater embodiment diversity (bimanual, semi-humanoid, full humanoid)
  • More task variety (tabletop, loco-manipulation, whole-body control)
  • Improved zero-shot generalization
  • Better foundation for downstream fine-tuning

Code-level improvements

Faster data loading

Sharded dataloader

New sharded dataloader implementation with significantly improved throughput for multi-GPU training.
Key features:
  • Parallel data loading across GPUs
  • Reduced I/O bottlenecks
  • Better utilization of distributed training

Simplified data processing

Single processing scriptAll data processing unified in processing_gr00t_n1d6.py with the Gr00tN1d6DataCollator:
# From gr00t/model/gr00t_n1d6/gr00t_n1d6.py:456
from .processing_gr00t_n1d6 import Gr00tN1d6DataCollator

self.collator = Gr00tN1d6DataCollator(
    model_name=config.model_name,
    model_type=config.backbone_model_type,
    transformers_loading_kwargs=transformers_loading_kwargs,
)

Flexible training configuration

N1.6 introduces more granular control over training:
# From gr00t/model/gr00t_n1d6/gr00t_n1d6.py:86-88
def set_trainable_parameters(
    self, tune_projector: bool, tune_diffusion_model: bool, tune_vlln: bool
):
    self.tune_projector = tune_projector
    self.tune_diffusion_model = tune_diffusion_model
    self.tune_vlln = tune_vlln
Training options:
  • tune_projector: Control state encoder and action decoder training
  • tune_diffusion_model: Control DiT training
  • tune_vlln: Control vision-language layer norm training
  • tune_top_llm_layers: Fine-tune specific VLM layers

Inference optimizations

RTC wrapper

Real-time control wrapper for low-latency deployment (coming soon)

Async policy

Asynchronous policy execution for parallel environment rollouts (coming soon)

Performance comparison

Inference timing on RTX 5090 (4 denoising steps, single view):
ComponentN1.6 (torch.compile)Notes
Data Processing2 msInput preprocessing
Backbone18 msVLM forward pass
Action Head16 msDiT denoising
End-to-End37 ms27.3 Hz
Despite the larger architecture (32 vs 16 DiT layers), N1.6 maintains competitive inference speed through optimization.

Hardware compatibility

Tested configurations:
  • RTX 5090: 27.3 Hz (recommended for development)
  • H100: 26.3 Hz (optimal for training)
  • RTX 4090: 22.8 Hz (good for deployment)
  • L40: Supported for fine-tuning
  • A6000: Longer training time but functional

Migration from N1.5

If you’re upgrading from GR00T N1.5:
1

Update repository

Pull the latest main branch (N1.6 is the default)
2

Update dependencies

Run uv sync --python 3.10 to update to new dependencies
3

Download new checkpoints

Use nvidia/GR00T-N1.6-3B instead of nvidia/GR00T-N1.5-3B
4

Update configs

Review modality configs for state-relative action changes
5

Retrain if needed

Consider retraining on your dataset to benefit from architecture improvements
N1.6 and N1.5 checkpoints are not compatible due to architecture differences. You cannot load N1.5 weights into N1.6 model architecture.

What’s next

Upcoming features in development:
  • RTC and Async Policy Wrapper for production deployment
  • Additional pre-trained checkpoints for more embodiments
  • Enhanced TensorRT optimization scripts
  • Improved documentation for real robot deployment
For the latest updates, follow the GitHub repository and research blog.

Build docs developers (and LLMs) love