What's new

To use the older GR00T N1.5 version, checkout the n1.5-release branch.

GR00T N1.6 represents a significant upgrade over GR00T N1.5, with improvements in both model architecture and training data leading to better performance in many aspects.

Model architecture improvements

Vision-language backbone

New VLM foundation

N1.6 uses an internal NVIDIA Cosmos-Reason-2B VLM variant with flexible resolution support and native aspect ratio encoding (no padding required).

The new vision-language model is trained on both general vision-language tasks and embodied reasoning tasks like next action prediction, providing stronger grounding for robotic control. Key advantages:

Flexible resolution processing without padding artifacts
Native aspect ratio support for diverse camera configurations
Embodied reasoning capabilities built into the foundation
Improved visual understanding for manipulation tasks

Diffusion transformer scaling

N1.6
N1.5

32-layer DiTThe action head uses a 32-layer diffusion transformer for increased model capacity and better action prediction.

# From gr00t/model/gr00t_n1d6/gr00t_n1d6.py
self.model = AlternateVLDiT(
    **config.diffusion_model_cfg,
    cross_attention_dim=config.backbone_embedding_dim,
    attend_text_every_n_blocks=config.attend_text_every_n_blocks,
)

Architecture simplification

Top-layer tuning

N1.6 removes the 4-layer transformer adapter after the VLM. Instead, the top 4 layers of the VLM are unfrozen during pretraining for more efficient fine-tuning.

Code reference: gr00t/model/modules/eagle_backbone.py:75-78

if tune_top_llm_layers > 0:
    for layer in self.model.language_model.model.layers[-tune_top_llm_layers:]:
        for param in layer.parameters():
            param.requires_grad = True

State-relative action prediction

N1.6 predicts state-relative action chunks for most embodiments, rather than absolute joint angles or end-effector positions used in N1.5.

This architectural change improves:

Generalization across different initial configurations
Transfer learning between similar embodiments
Robustness to calibration differences

Training data expansion

Beyond the N1.5 data mixture, the N1.6 pretraining data additionally includes several thousand hours of teleoperated data from:

Bimanual YAM arms

High-quality bimanual manipulation demonstrations

AGIBot Genie1

Humanoid robot teleoperation data

Galaxea R1 Pro (BEHAVIOR)

Simulated whole-body loco-manipulation on BEHAVIOR-1K suite

Unitree G1

Whole-body locomanipulation demonstrations

The expanded dataset provides:

Greater embodiment diversity (bimanual, semi-humanoid, full humanoid)
More task variety (tabletop, loco-manipulation, whole-body control)
Improved zero-shot generalization
Better foundation for downstream fine-tuning

Code-level improvements

Faster data loading

Sharded dataloader

New sharded dataloader implementation with significantly improved throughput for multi-GPU training.

Key features:

Parallel data loading across GPUs
Reduced I/O bottlenecks
Better utilization of distributed training

Simplified data processing

N1.6
N1.5

Single processing scriptAll data processing unified in processing_gr00t_n1d6.py with the Gr00tN1d6DataCollator:

# From gr00t/model/gr00t_n1d6/gr00t_n1d6.py:456
from .processing_gr00t_n1d6 import Gr00tN1d6DataCollator

self.collator = Gr00tN1d6DataCollator(
    model_name=config.model_name,
    model_type=config.backbone_model_type,
    transformers_loading_kwargs=transformers_loading_kwargs,
)

Flexible training configuration

N1.6 introduces more granular control over training:

# From gr00t/model/gr00t_n1d6/gr00t_n1d6.py:86-88
def set_trainable_parameters(
    self, tune_projector: bool, tune_diffusion_model: bool, tune_vlln: bool
):
    self.tune_projector = tune_projector
    self.tune_diffusion_model = tune_diffusion_model
    self.tune_vlln = tune_vlln

Training options:

tune_projector: Control state encoder and action decoder training
tune_diffusion_model: Control DiT training
tune_vlln: Control vision-language layer norm training
tune_top_llm_layers: Fine-tune specific VLM layers

Inference optimizations

RTC wrapper

Real-time control wrapper for low-latency deployment (coming soon)

Async policy

Asynchronous policy execution for parallel environment rollouts (coming soon)

Performance comparison

Inference timing on RTX 5090 (4 denoising steps, single view):

Component	N1.6 (torch.compile)	Notes
Data Processing	2 ms	Input preprocessing
Backbone	18 ms	VLM forward pass
Action Head	16 ms	DiT denoising
End-to-End	37 ms	27.3 Hz

Despite the larger architecture (32 vs 16 DiT layers), N1.6 maintains competitive inference speed through optimization.

Hardware compatibility

Tested configurations:

GPU
Edge

RTX 5090: 27.3 Hz (recommended for development)
H100: 26.3 Hz (optimal for training)
RTX 4090: 22.8 Hz (good for deployment)
L40: Supported for fine-tuning
A6000: Longer training time but functional

Migration from N1.5

If you’re upgrading from GR00T N1.5:

Update repository

Pull the latest main branch (N1.6 is the default)

Update dependencies

Run uv sync --python 3.10 to update to new dependencies

Download new checkpoints

Use nvidia/GR00T-N1.6-3B instead of nvidia/GR00T-N1.5-3B

Update configs

Review modality configs for state-relative action changes

Retrain if needed

Consider retraining on your dataset to benefit from architecture improvements

N1.6 and N1.5 checkpoints are not compatible due to architecture differences. You cannot load N1.5 weights into N1.6 model architecture.

What’s next

Upcoming features in development:

RTC and Async Policy Wrapper for production deployment
Additional pre-trained checkpoints for more embodiments
Enhanced TensorRT optimization scripts
Improved documentation for real robot deployment

For the latest updates, follow the GitHub repository and research blog.

Overview

Getting Started

Core Concepts

Guides

Benchmarks & Examples

Deployment

Resources

Model architecture improvements

Vision-language backbone

New VLM foundation

Diffusion transformer scaling

Architecture simplification

Top-layer tuning

State-relative action prediction

Training data expansion

Bimanual YAM arms

AGIBot Genie1

Galaxea R1 Pro (BEHAVIOR)

Unitree G1

Code-level improvements

Faster data loading

Sharded dataloader

Simplified data processing

Flexible training configuration

Inference optimizations

RTC wrapper

Async policy

Performance comparison

Hardware compatibility

Migration from N1.5

What’s next

Build docs developers (and LLMs) love

Overview

Getting Started

Core Concepts

Guides

Benchmarks & Examples

Deployment

Resources

Documentation Index

​Model architecture improvements

​Vision-language backbone

New VLM foundation

​Diffusion transformer scaling

​Architecture simplification

Top-layer tuning

​State-relative action prediction

​Training data expansion

Bimanual YAM arms

AGIBot Genie1

Galaxea R1 Pro (BEHAVIOR)

Unitree G1

​Code-level improvements

​Faster data loading

Sharded dataloader

​Simplified data processing

​Flexible training configuration

​Inference optimizations

RTC wrapper

Async policy

​Performance comparison

​Hardware compatibility

​Migration from N1.5

​What’s next

Build docs developers (and LLMs) love

Model architecture improvements

Vision-language backbone

Diffusion transformer scaling

Architecture simplification

State-relative action prediction

Training data expansion

Code-level improvements

Faster data loading

Simplified data processing

Flexible training configuration

Inference optimizations

Performance comparison

Hardware compatibility

Migration from N1.5

What’s next