What is GR00T N1.6?
NVIDIA Isaac GR00T N1.6 is an open vision-language-action (VLA) model for generalized humanoid robot skills. This cross-embodiment model takes multimodal input, including language and images, to perform manipulation tasks in diverse environments. GR00T N1.6 is trained on a diverse mixture of robot data including bimanual, semi-humanoid and an expansive humanoid dataset. It is adaptable through post-training for specific embodiments, tasks and environments.GR00T N1.6 represents a significant upgrade over GR00T N1.5, with improvements in both model architecture and data leading to better performance across benchmarks.
Key capabilities
Cross-embodiment learning
Trained on 10,000+ hours of robot data from diverse embodiments including bimanual arms, semi-humanoid robots, and full humanoid platforms like Unitree G1 and Galaxea R1 Pro.
Multimodal understanding
Processes vision, language, and proprioceptive state inputs using a 2B parameter vision-language backbone with flexible resolution support.
Flow matching diffusion
Generates smooth, continuous actions through a 32-layer diffusion transformer that denoises action trajectories.
State-relative actions
Predicts state-relative action chunks for most embodiments, improving generalization across different robot configurations.
Fast inference
Achieves 27.3 Hz on RTX 5090 with torch.compile, with even faster TensorRT deployment options available.
Few-shot adaptation
Fine-tune on small custom datasets to adapt the foundation model to specific robotics tasks with minimal data.
Model architecture
The neural network architecture of GR00T N1.6 combines a vision-language foundation model with a diffusion transformer head that denoises continuous actions:
The architecture consists of three main components:
- Vision-language backbone: NVIDIA Cosmos-Reason-2B VLM variant with flexible resolution and native aspect ratio support
- Action head: 32-layer diffusion transformer (DiT) with cross-attention to VLM features
- Embodiment-specific projectors: Category-specific MLPs for encoding states and decoding actions per robot
Performance benchmarks
GR00T N1.6 achieves state-of-the-art results across multiple simulation benchmarks:| Benchmark | Task Type | Success Rate |
|---|---|---|
| LIBERO-Spatial | Tabletop manipulation | High performance |
| SimplerEnv | Bimanual tasks | Competitive |
| BEHAVIOR-1K | Loco-manipulation | Strong results |
| RoboCasa | Kitchen tasks | Zero-shot capable |
Inference timing on RTX 5090 with 4 denoising steps: 37ms end-to-end (18ms backbone + 16ms action head), achieving 27.3 Hz throughput.
Getting started
Installation
Set up your environment with uv package manager and install dependencies
Quick start
Run inference with pre-trained checkpoints in minutes
Fine-tuning guide
Adapt GR00T to your robot embodiment and tasks
Examples
Explore simulation benchmarks and deployment examples
Target audience
GR00T N1.6 is intended for researchers and professionals in robotics. This repository provides tools to:- Leverage a pre-trained foundation model for robot control
- Fine-tune on small, custom datasets
- Adapt the model to specific robotics tasks with minimal data
- Deploy the model for inference on real hardware
Resources
Research paper
Read the full technical paper on arXiv
Model weights
Download pre-trained checkpoints from Hugging Face
Research blog
Explore the official research blog post
Training dataset
Access the cross-embodiment training dataset