Introduction

What is GR00T N1.6?

NVIDIA Isaac GR00T N1.6 is an open vision-language-action (VLA) model for generalized humanoid robot skills. This cross-embodiment model takes multimodal input, including language and images, to perform manipulation tasks in diverse environments. GR00T N1.6 is trained on a diverse mixture of robot data including bimanual, semi-humanoid and an expansive humanoid dataset. It is adaptable through post-training for specific embodiments, tasks and environments.

GR00T N1.6 represents a significant upgrade over GR00T N1.5, with improvements in both model architecture and data leading to better performance across benchmarks.

Key capabilities

Cross-embodiment learning

Trained on 10,000+ hours of robot data from diverse embodiments including bimanual arms, semi-humanoid robots, and full humanoid platforms like Unitree G1 and Galaxea R1 Pro.

Multimodal understanding

Processes vision, language, and proprioceptive state inputs using a 2B parameter vision-language backbone with flexible resolution support.

Flow matching diffusion

Generates smooth, continuous actions through a 32-layer diffusion transformer that denoises action trajectories.

State-relative actions

Predicts state-relative action chunks for most embodiments, improving generalization across different robot configurations.

Fast inference

Achieves 27.3 Hz on RTX 5090 with torch.compile, with even faster TensorRT deployment options available.

Few-shot adaptation

Fine-tune on small custom datasets to adapt the foundation model to specific robotics tasks with minimal data.

Model architecture

The neural network architecture of GR00T N1.6 combines a vision-language foundation model with a diffusion transformer head that denoises continuous actions: GR00T N1.6 Architecture

The architecture consists of three main components:

Vision-language backbone: NVIDIA Cosmos-Reason-2B VLM variant with flexible resolution and native aspect ratio support
Action head: 32-layer diffusion transformer (DiT) with cross-attention to VLM features
Embodiment-specific projectors: Category-specific MLPs for encoding states and decoding actions per robot

Performance benchmarks

GR00T N1.6 achieves state-of-the-art results across multiple simulation benchmarks:

Benchmark	Task Type	Success Rate
LIBERO-Spatial	Tabletop manipulation	High performance
SimplerEnv	Bimanual tasks	Competitive
BEHAVIOR-1K	Loco-manipulation	Strong results
RoboCasa	Kitchen tasks	Zero-shot capable

Inference timing on RTX 5090 with 4 denoising steps: 37ms end-to-end (18ms backbone + 16ms action head), achieving 27.3 Hz throughput.

Getting started

Installation

Set up your environment with uv package manager and install dependencies

Quick start

Run inference with pre-trained checkpoints in minutes

Fine-tuning guide

Adapt GR00T to your robot embodiment and tasks

Examples

Explore simulation benchmarks and deployment examples

Target audience

GR00T N1.6 is intended for researchers and professionals in robotics. This repository provides tools to:

Leverage a pre-trained foundation model for robot control
Fine-tune on small, custom datasets
Adapt the model to specific robotics tasks with minimal data
Deploy the model for inference on real hardware

The focus is on enabling customization of robot behaviors through fine-tuning.

Resources

Research paper

Read the full technical paper on arXiv

Model weights

Download pre-trained checkpoints from Hugging Face

Research blog

Explore the official research blog post

Training dataset

Access the cross-embodiment training dataset

Overview

Getting Started

Core Concepts

Guides

Benchmarks & Examples

Deployment

Resources

What is GR00T N1.6?

Key capabilities

Cross-embodiment learning

Multimodal understanding

Flow matching diffusion

State-relative actions

Fast inference

Few-shot adaptation

Model architecture

Performance benchmarks

Getting started

Installation

Quick start

Fine-tuning guide

Examples

Target audience

Resources

Research paper

Model weights

Research blog

Training dataset

Build docs developers (and LLMs) love

Overview

Getting Started

Core Concepts

Guides

Benchmarks & Examples

Deployment

Resources

Documentation Index

​What is GR00T N1.6?

​Key capabilities

Cross-embodiment learning

Multimodal understanding

Flow matching diffusion

State-relative actions

Fast inference

Few-shot adaptation

​Model architecture

​Performance benchmarks

​Getting started

Installation

Quick start

Fine-tuning guide

Examples

​Target audience

​Resources

Research paper

Model weights

Research blog

Training dataset

Build docs developers (and LLMs) love

What is GR00T N1.6?

Key capabilities

Model architecture

Performance benchmarks

Getting started

Target audience

Resources