Skip to main content
NVIDIA Isaac GR00T N1.6 is an open vision-language-action (VLA) model for generalized humanoid robot skills. This cross-embodiment model takes multimodal input, including language and images, to perform manipulation tasks in diverse environments.

Architecture

The neural network architecture of GR00T N1.6 is a combination of vision-language foundation model and diffusion transformer head that denoises continuous actions. GR00T N1.6 architecture diagram

Key components

The model consists of three main components:
  1. Vision-Language Model (VLM): Cosmos-Reason-2B variant that encodes images in their native aspect ratio without padding
  2. Diffusion Transformer (DiT): 32-layer transformer that denoises continuous action chunks
  3. Action Prediction Head: Outputs state-relative action chunks for most embodiments

Model capabilities

GR00T N1.6 is trained on a diverse mixture of robot data and is adaptable through post-training for specific embodiments, tasks, and environments.

Training data

The model is pre-trained on 10,000+ hours of robot data from various embodiments:
  • Bimanual YAM arms
  • AGIBot Genie1
  • Simulated Galaxea R1 Pro on the BEHAVIOR suite
  • Whole-body locomanipulation with Unitree G1
  • RoboCasa Panda robot with omron mobile base
  • Fourier GR1 robot

Cross-embodiment support

GR00T N1.6 supports multiple robot embodiments out of the box. See Embodiment tags for the full list of supported robots.

What’s new in N1.6

GR00T N1.6 represents a significant upgrade over GR00T N1.5, with improvements in both model architecture and data leading to better performance.

Architectural changes

  • Base VLM: Uses an internal NVIDIA Cosmos-Reason-2B VLM variant with flexible resolution support
  • Larger DiT: 2x larger DiT (32 layers vs 16 layers in N1.5)
  • Simplified architecture: Removes N1.5’s post-VLM 4-layer transformer adapter and unfreezes top 4 layers of the VLM during pretraining
  • State-relative actions: Predicts state-relative action chunks for most embodiments, rather than absolute joint angles or EEF positions

Performance improvements

GR00T-N1.6-3B achieves real-time inference at 27.3 Hz on RTX 5090 with torch.compile optimization.
Inference timing (4 denoising steps, single view):
DeviceModeData ProcessingBackboneAction HeadE2EFrequency
RTX 5090torch.compile2 ms18 ms16 ms37 ms27.3 Hz
H100torch.compile4 ms23 ms11 ms38 ms26.3 Hz
RTX 4090torch.compile2 ms25 ms17 ms44 ms22.8 Hz
Thortorch.compile5 ms39 ms61 ms105 ms9.5 Hz

Use cases

GR00T N1.6 is designed for researchers and professionals in robotics to:
  • Leverage a pre-trained foundation model for robot control
  • Fine-tune on small, custom datasets
  • Adapt the model to specific robotics tasks with minimal data
  • Deploy the model for inference on various hardware platforms
The focus is on enabling customization of robot behaviors through finetuning rather than training from scratch.

Model checkpoints

Base models

Pre-trained base VLA model checkpoints are available for finetuning:

Finetuned models

Finetuned checkpoints are available for various robot platforms and benchmarks:
  • GR00T-N1.6-bridge: Fine-tuned on Bridge dataset for WidowX robot
  • GR00T-N1.6-fractal: Fine-tuned on Fractal dataset for Google robot
  • GR00T-N1.6-BEHAVIOR1k: Fine-tuned on BEHAVIOR-1K for Galaxea R1 Pro
  • GR00T-N1.6-G1-PnPAppleToPlate: Fine-tuned for Unitree G1 loco-manipulation
  • GR00T-N1.6-DROID: Fine-tuned for DROID robot

Next steps

Embodiment tags

Learn about supported robot embodiments

Modality configs

Configure your robot’s data processing

Data format

Understand the LeRobot v2 data format

Fine-tuning guide

Start fine-tuning on your robot

Build docs developers (and LLMs) love