Model overview

NVIDIA Isaac GR00T N1.6 is an open vision-language-action (VLA) model for generalized humanoid robot skills. This cross-embodiment model takes multimodal input, including language and images, to perform manipulation tasks in diverse environments.

Architecture

The neural network architecture of GR00T N1.6 is a combination of vision-language foundation model and diffusion transformer head that denoises continuous actions. GR00T N1.6 architecture diagram

Key components

The model consists of three main components:

Vision-Language Model (VLM): Cosmos-Reason-2B variant that encodes images in their native aspect ratio without padding
Diffusion Transformer (DiT): 32-layer transformer that denoises continuous action chunks
Action Prediction Head: Outputs state-relative action chunks for most embodiments

Model capabilities

GR00T N1.6 is trained on a diverse mixture of robot data and is adaptable through post-training for specific embodiments, tasks, and environments.

Training data

The model is pre-trained on 10,000+ hours of robot data from various embodiments:

Bimanual YAM arms
AGIBot Genie1
Simulated Galaxea R1 Pro on the BEHAVIOR suite
Whole-body locomanipulation with Unitree G1
RoboCasa Panda robot with omron mobile base
Fourier GR1 robot

Cross-embodiment support

GR00T N1.6 supports multiple robot embodiments out of the box. See Embodiment tags for the full list of supported robots.

What’s new in N1.6

GR00T N1.6 represents a significant upgrade over GR00T N1.5, with improvements in both model architecture and data leading to better performance.

Architectural changes

Base VLM: Uses an internal NVIDIA Cosmos-Reason-2B VLM variant with flexible resolution support
Larger DiT: 2x larger DiT (32 layers vs 16 layers in N1.5)
Simplified architecture: Removes N1.5’s post-VLM 4-layer transformer adapter and unfreezes top 4 layers of the VLM during pretraining
State-relative actions: Predicts state-relative action chunks for most embodiments, rather than absolute joint angles or EEF positions

Performance improvements

GR00T-N1.6-3B achieves real-time inference at 27.3 Hz on RTX 5090 with torch.compile optimization.

Inference timing (4 denoising steps, single view):

Device	Mode	Data Processing	Backbone	Action Head	E2E	Frequency
RTX 5090	torch.compile	2 ms	18 ms	16 ms	37 ms	27.3 Hz
H100	torch.compile	4 ms	23 ms	11 ms	38 ms	26.3 Hz
RTX 4090	torch.compile	2 ms	25 ms	17 ms	44 ms	22.8 Hz
Thor	torch.compile	5 ms	39 ms	61 ms	105 ms	9.5 Hz

Use cases

GR00T N1.6 is designed for researchers and professionals in robotics to:

Leverage a pre-trained foundation model for robot control
Fine-tune on small, custom datasets
Adapt the model to specific robotics tasks with minimal data
Deploy the model for inference on various hardware platforms

The focus is on enabling customization of robot behaviors through finetuning rather than training from scratch.

Model checkpoints

Base models

Pre-trained base VLA model checkpoints are available for finetuning:

GR00T N1.6: nvidia/GR00T-N1.6-3B (3B parameters)
GR00T N1.5: nvidia/GR00T-N1.5-3B (3B parameters)

Finetuned models

Finetuned checkpoints are available for various robot platforms and benchmarks:

GR00T-N1.6-bridge: Fine-tuned on Bridge dataset for WidowX robot
GR00T-N1.6-fractal: Fine-tuned on Fractal dataset for Google robot
GR00T-N1.6-BEHAVIOR1k: Fine-tuned on BEHAVIOR-1K for Galaxea R1 Pro
GR00T-N1.6-G1-PnPAppleToPlate: Fine-tuned for Unitree G1 loco-manipulation
GR00T-N1.6-DROID: Fine-tuned for DROID robot

Next steps

Embodiment tags

Learn about supported robot embodiments

Modality configs

Configure your robot’s data processing

Data format

Understand the LeRobot v2 data format

Fine-tuning guide

Start fine-tuning on your robot

Overview

Getting Started

Core Concepts

Guides

Benchmarks & Examples

Deployment

Resources

Architecture

Key components

Model capabilities

Training data

Cross-embodiment support

What’s new in N1.6

Architectural changes

Performance improvements

Use cases

Model checkpoints

Base models

Finetuned models

Next steps

Embodiment tags

Modality configs

Data format

Fine-tuning guide

Build docs developers (and LLMs) love

Overview

Getting Started

Core Concepts

Guides

Benchmarks & Examples

Deployment

Resources

Documentation Index

​Architecture

​Key components

​Model capabilities

​Training data

​Cross-embodiment support

​What’s new in N1.6

​Architectural changes

​Performance improvements

​Use cases

​Model checkpoints

​Base models

​Finetuned models

​Next steps

Embodiment tags

Modality configs

Data format

Fine-tuning guide

Build docs developers (and LLMs) love

Architecture

Key components

Model capabilities

Training data

Cross-embodiment support

What’s new in N1.6

Architectural changes

Performance improvements

Use cases

Model checkpoints

Base models

Finetuned models

Next steps