Skip to main content
NVIDIA Isaac GR00T N1.6

What is GR00T N1.6?

NVIDIA Isaac GR00T N1.6 is an open vision-language-action (VLA) model for generalized humanoid robot skills. This cross-embodiment model takes multimodal input, including language and images, to perform manipulation tasks in diverse environments. GR00T N1.6 is trained on a diverse mixture of robot data including bimanual, semi-humanoid and an expansive humanoid dataset. It is adaptable through post-training for specific embodiments, tasks and environments.
GR00T N1.6 represents a significant upgrade over GR00T N1.5, with improvements in both model architecture and data leading to better performance across benchmarks.

Key capabilities

Cross-embodiment learning

Trained on 10,000+ hours of robot data from diverse embodiments including bimanual arms, semi-humanoid robots, and full humanoid platforms like Unitree G1 and Galaxea R1 Pro.

Multimodal understanding

Processes vision, language, and proprioceptive state inputs using a 2B parameter vision-language backbone with flexible resolution support.

Flow matching diffusion

Generates smooth, continuous actions through a 32-layer diffusion transformer that denoises action trajectories.

State-relative actions

Predicts state-relative action chunks for most embodiments, improving generalization across different robot configurations.

Fast inference

Achieves 27.3 Hz on RTX 5090 with torch.compile, with even faster TensorRT deployment options available.

Few-shot adaptation

Fine-tune on small custom datasets to adapt the foundation model to specific robotics tasks with minimal data.

Model architecture

The neural network architecture of GR00T N1.6 combines a vision-language foundation model with a diffusion transformer head that denoises continuous actions: GR00T N1.6 Architecture The architecture consists of three main components:
  1. Vision-language backbone: NVIDIA Cosmos-Reason-2B VLM variant with flexible resolution and native aspect ratio support
  2. Action head: 32-layer diffusion transformer (DiT) with cross-attention to VLM features
  3. Embodiment-specific projectors: Category-specific MLPs for encoding states and decoding actions per robot

Performance benchmarks

GR00T N1.6 achieves state-of-the-art results across multiple simulation benchmarks:
BenchmarkTask TypeSuccess Rate
LIBERO-SpatialTabletop manipulationHigh performance
SimplerEnvBimanual tasksCompetitive
BEHAVIOR-1KLoco-manipulationStrong results
RoboCasaKitchen tasksZero-shot capable
Inference timing on RTX 5090 with 4 denoising steps: 37ms end-to-end (18ms backbone + 16ms action head), achieving 27.3 Hz throughput.

Getting started

Installation

Set up your environment with uv package manager and install dependencies

Quick start

Run inference with pre-trained checkpoints in minutes

Fine-tuning guide

Adapt GR00T to your robot embodiment and tasks

Examples

Explore simulation benchmarks and deployment examples

Target audience

GR00T N1.6 is intended for researchers and professionals in robotics. This repository provides tools to:
  • Leverage a pre-trained foundation model for robot control
  • Fine-tune on small, custom datasets
  • Adapt the model to specific robotics tasks with minimal data
  • Deploy the model for inference on real hardware
The focus is on enabling customization of robot behaviors through fine-tuning.

Resources

Research paper

Read the full technical paper on arXiv

Model weights

Download pre-trained checkpoints from Hugging Face

Research blog

Explore the official research blog post

Training dataset

Access the cross-embodiment training dataset

Build docs developers (and LLMs) love