To use the older GR00T N1.5 version, checkout the n1.5-release branch.
Model architecture improvements
Vision-language backbone
New VLM foundation
N1.6 uses an internal NVIDIA Cosmos-Reason-2B VLM variant with flexible resolution support and native aspect ratio encoding (no padding required).
- Flexible resolution processing without padding artifacts
- Native aspect ratio support for diverse camera configurations
- Embodied reasoning capabilities built into the foundation
- Improved visual understanding for manipulation tasks
Diffusion transformer scaling
- N1.6
- N1.5
32-layer DiTThe action head uses a 32-layer diffusion transformer for increased model capacity and better action prediction.
Architecture simplification
Top-layer tuning
N1.6 removes the 4-layer transformer adapter after the VLM. Instead, the top 4 layers of the VLM are unfrozen during pretraining for more efficient fine-tuning.
gr00t/model/modules/eagle_backbone.py:75-78
State-relative action prediction
N1.6 predicts state-relative action chunks for most embodiments, rather than absolute joint angles or end-effector positions used in N1.5.
- Generalization across different initial configurations
- Transfer learning between similar embodiments
- Robustness to calibration differences
Training data expansion
Beyond the N1.5 data mixture, the N1.6 pretraining data additionally includes several thousand hours of teleoperated data from:Bimanual YAM arms
High-quality bimanual manipulation demonstrations
AGIBot Genie1
Humanoid robot teleoperation data
Galaxea R1 Pro (BEHAVIOR)
Simulated whole-body loco-manipulation on BEHAVIOR-1K suite
Unitree G1
Whole-body locomanipulation demonstrations
- Greater embodiment diversity (bimanual, semi-humanoid, full humanoid)
- More task variety (tabletop, loco-manipulation, whole-body control)
- Improved zero-shot generalization
- Better foundation for downstream fine-tuning
Code-level improvements
Faster data loading
Sharded dataloader
New sharded dataloader implementation with significantly improved throughput for multi-GPU training.
- Parallel data loading across GPUs
- Reduced I/O bottlenecks
- Better utilization of distributed training
Simplified data processing
- N1.6
- N1.5
Single processing scriptAll data processing unified in
processing_gr00t_n1d6.py with the Gr00tN1d6DataCollator:Flexible training configuration
N1.6 introduces more granular control over training:tune_projector: Control state encoder and action decoder trainingtune_diffusion_model: Control DiT trainingtune_vlln: Control vision-language layer norm trainingtune_top_llm_layers: Fine-tune specific VLM layers
Inference optimizations
RTC wrapper
Real-time control wrapper for low-latency deployment (coming soon)
Async policy
Asynchronous policy execution for parallel environment rollouts (coming soon)
Performance comparison
Inference timing on RTX 5090 (4 denoising steps, single view):| Component | N1.6 (torch.compile) | Notes |
|---|---|---|
| Data Processing | 2 ms | Input preprocessing |
| Backbone | 18 ms | VLM forward pass |
| Action Head | 16 ms | DiT denoising |
| End-to-End | 37 ms | 27.3 Hz |
Despite the larger architecture (32 vs 16 DiT layers), N1.6 maintains competitive inference speed through optimization.
Hardware compatibility
Tested configurations:- GPU
- Edge
- RTX 5090: 27.3 Hz (recommended for development)
- H100: 26.3 Hz (optimal for training)
- RTX 4090: 22.8 Hz (good for deployment)
- L40: Supported for fine-tuning
- A6000: Longer training time but functional
Migration from N1.5
If you’re upgrading from GR00T N1.5:What’s next
Upcoming features in development:- RTC and Async Policy Wrapper for production deployment
- Additional pre-trained checkpoints for more embodiments
- Enhanced TensorRT optimization scripts
- Improved documentation for real robot deployment