Skip to main content
GR00T N models are designed for training on high-performance GPU servers and deployment on edge devices like NVIDIA Jetson AGX Thor. This guide outlines the recommended hardware configurations for different use cases.

Overview

The GR00T workflow consists of two main phases:
  1. Post-training (Finetuning): Train or finetune GR00T models on your custom robot data
  2. Deployment: Run optimized inference on robot hardware for real-time control
Workflow Diagram For teams getting started with GR00T, the recommended configuration includes:

RTX PRO Server (Training)

ComponentSpecification
GPUs8x RTX PRO 6000 Blackwell Server Edition GPUs
GPU Memory800 GB total GPU memory
VirtualizationVMware RedHat
ContainerizationKubernetes
StorageScalable, interoperable, secure storage

Jetson AGX Thor Developer Kit (Deployment)

ComponentSpecification
GPUBlackwell GPU with 2560 CUDA cores
CPU14-core Arm Neoverse-V3AE CPU
Memory128GB LP5 memory

Center of excellence

For larger-scale deployments and research centers:

DGX B300 Server (Training)

ComponentSpecification
GPUsNVIDIA Blackwell Ultra GPUs
GPU Memory2.3 TB total GPU memory
VirtualizationVMware RedHat
ContainerizationKubernetes
StorageScalable, interoperable, secure storage

Jetson AGX Thor Developer Kit (Deployment)

ComponentSpecification
GPUBlackwell GPU with 2560 CUDA cores
CPU14-core Arm Neoverse-V3AE CPU
Memory128GB LP5 memory

Training performance

GR00T finetuning performance varies based on GPU hardware:
We recommend using 1 H100 node or L40 node for optimal finetuning performance. Other hardware configurations (e.g., A6000) will also work but may require longer training time.

Batch size recommendations

Optimal batch size depends on:
  • Available GPU memory
  • Which model components are being tuned (full model vs. adapter-only)
  • Number of GPUs available
For best results, maximize your batch size based on available hardware and train for a few thousand steps.

Multi-GPU training

GR00T supports distributed training across multiple GPUs:
export NUM_GPUS=8

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 uv run python \
    gr00t/experiment/launch_finetune.py \
    --base-model-path nvidia/GR00T-N1.6-3B \
    --dataset-path <DATASET_PATH> \
    --embodiment-tag NEW_EMBODIMENT \
    --num-gpus $NUM_GPUS \
    --global-batch-size 32 \
    --output-dir <OUTPUT_PATH>

Inference performance

GR00T N1.6 3B inference timing (4 denoising steps, single view):

Desktop GPUs

DeviceModeData ProcessingBackboneAction HeadE2EFrequency
RTX 5090torch.compile2 ms18 ms16 ms37 ms27.3 Hz
RTX 5090TensorRT2 ms18 ms11 ms31 ms32.1 Hz
RTX 4090torch.compile2 ms25 ms17 ms44 ms22.8 Hz
RTX 4090TensorRT2 ms24 ms16 ms43 ms23.3 Hz

Data center GPUs

DeviceModeData ProcessingBackboneAction HeadE2EFrequency
H100torch.compile4 ms23 ms11 ms38 ms26.3 Hz
H100TensorRT4 ms22 ms10 ms36 ms27.9 Hz

Edge devices

DeviceModeData ProcessingBackboneAction HeadE2EFrequency
Thortorch.compile5 ms39 ms61 ms105 ms9.5 Hz
ThorTensorRT5 ms38 ms49 ms92 ms10.9 Hz
Orintorch.compile6 ms93 ms101 ms199 ms5.0 Hz
OrinTensorRT6 ms95 ms72 ms173 ms5.8 Hz
The backbone (Vision Encoder + Language Model) timing is the same across all modes. Only the Action Head (DiT) is optimized with torch.compile or TensorRT, which is why you see significant speedups in the Action Head column while the Backbone column remains constant.

Speedup vs PyTorch eager mode

Devicetorch.compileTensorRT
RTX 50901.58x1.86x
H1002.02x2.14x
RTX 40901.87x1.92x
Thor1.11x1.27x
Orin1.50x1.73x

Minimum requirements

For training/finetuning

  • GPU: NVIDIA GPU with 24GB+ VRAM (e.g., RTX 4090, A6000, or better)
  • CUDA: Version 12.4 (recommended) or 11.8
  • System Memory: 32GB+ RAM recommended
  • Storage: 100GB+ free space for datasets and checkpoints

For inference only

  • GPU: NVIDIA GPU with 8GB+ VRAM
  • CUDA: Version 12.4 (recommended) or 11.8
  • System Memory: 16GB+ RAM
  • Storage: 50GB+ free space for model checkpoints

For edge deployment

  • Jetson AGX Thor: For optimal performance
  • Jetson AGX Orin: Supported but with reduced inference speed
GR00T requires CUDA-capable NVIDIA GPUs. CPU-only inference is not supported for production use due to performance constraints.

CUDA compatibility

CUDA VersionStatusNotes
12.8TestedFor RTX 5090, use flash-attn==2.8.0.post2, pytorch-cu128
12.4RecommendedOfficially tested and recommended
11.8SupportedRequires manual installation of compatible flash-attn==2.8.2

Storage considerations

Dataset storage

  • Robot demonstration datasets can range from 10GB to 1TB+ depending on:
    • Number of episodes
    • Video resolution
    • Number of camera views
    • Episode length
  • Use fast storage (NVMe SSD) for training datasets to avoid I/O bottlenecks

Model checkpoints

  • Base GR00T N1.6 3B model: ~6GB
  • Finetuned checkpoints: ~6GB each
  • ONNX exported models: ~3GB
  • TensorRT engines: ~2GB (GPU-specific)
  • Training server: 1TB+ NVMe SSD for datasets and checkpoints
  • Deployment device: 64GB+ for model checkpoints and TensorRT engines

Network requirements

Model download

  • First-time model download from Hugging Face: ~6GB bandwidth
  • Ensure stable internet connection for initial setup

Distributed training

  • For multi-node training, high-bandwidth interconnect (InfiniBand or 100GbE) recommended
  • For single-node multi-GPU, PCIe 4.0 or higher

Next steps

Installation

Install GR00T on your hardware

Quick start

Run your first inference example

TensorRT optimization

Optimize inference with TensorRT

Finetuning

Finetune GR00T on your data

Build docs developers (and LLMs) love