Skip to main content
This guide covers running inference with PyTorch or TensorRT acceleration for the GR00T policy.

Prerequisites

  • Model checkpoint (e.g., nvidia/GR00T-N1.6-3B)
  • Dataset in LeRobot format
  • CUDA-enabled GPU

Installation

uv sync

Quick start: PyTorch mode

python scripts/deployment/standalone_inference_script.py \
  --model-path nvidia/GR00T-N1.6-3B \
  --dataset-path /path/to/dataset \
  --embodiment-tag GR1 \
  --traj-ids 0 1 2 \
  --inference-mode pytorch \
  --action-horizon 8

TensorRT mode (2x faster)

TensorRT provides approximately 2x speedup for the action head (DiT) component.
1
Export to ONNX
2
python scripts/deployment/export_onnx_n1d6.py \
  --model-path nvidia/GR00T-N1.6-3B \
  --dataset-path /path/to/dataset \
  --embodiment-tag GR1 \
  --output-dir ./groot_n1d6_onnx
3
Output: ./groot_n1d6_onnx/dit_model.onnx
4
Build TensorRT engine
5
python scripts/deployment/build_tensorrt_engine.py \
  --onnx ./groot_n1d6_onnx/dit_model.onnx \
  --engine ./groot_n1d6_onnx/dit_model_bf16.trt \
  --precision bf16
6
Output: ./groot_n1d6_onnx/dit_model_bf16.trt
7
Engine build takes approximately 5-10 minutes depending on GPU. The engine is GPU-specific and needs to be rebuilt for different GPU architectures.
8
Run with TensorRT
9
python scripts/deployment/standalone_inference_script.py \
  --model-path nvidia/GR00T-N1.6-3B \
  --dataset-path /path/to/dataset \
  --embodiment-tag GR1 \
  --traj-ids 0 1 2 \
  --inference-mode tensorrt \
  --trt-engine-path ./groot_n1d6_onnx/dit_model_bf16.trt \
  --action-horizon 8

Command-line arguments

standalone_inference_script.py

ArgumentDefaultDescription
--model-path(required)Path to model checkpoint
--dataset-path(required)Path to LeRobot dataset
--embodiment-tagGR1Embodiment tag
--traj-ids[0]List of trajectory IDs to evaluate
--steps200Max steps per trajectory
--action-horizon16Action horizon for inference
--inference-modepytorchpytorch or tensorrt
--trt-engine-path./groot_n1d6_onnx/dit_model_bf16.trtTensorRT engine path
--denoising-steps4Number of denoising steps
--skip-timing-steps1Steps to skip for timing (warmup)
--seed42Random seed for reproducibility
--video-backendtorchcodecVideo backend (decord, torchvision_av, torchcodec)

export_onnx_n1d6.py

ArgumentDefaultDescription
--model-path(required)Path to model checkpoint
--dataset-path(required)Path to dataset (for input shape capture)
--embodiment-tagGR1Embodiment tag
--output-dir./groot_n1d6_onnxOutput directory for ONNX model
--video-backendtorchcodecVideo backend

build_tensorrt_engine.py

ArgumentDefaultDescription
--onnx(required)Path to ONNX model
--engine(required)Path to save TensorRT engine
--precisionbf16Precision (fp32, fp16, bf16, fp8)
--workspace8192Workspace size in MB

Benchmarks

GR00T-N1.6-3B inference timing with 4 denoising steps:
The backbone (Vision Encoder + Language Model) timing is the same across all modes. Only the Action Head (DiT) is optimized with torch.compile or TensorRT.

Component-wise breakdown

| Device | Mode | Data Processing | Backbone | Action Head | E2E | Frequency | |--------|------|-----------------|----------|-------------|-----|-----------|| | RTX 5090 | PyTorch Eager | 2 ms | 18 ms | 38 ms | 58 ms | 17.3 Hz | | RTX 5090 | torch.compile | 2 ms | 18 ms | 16 ms | 37 ms | 27.3 Hz | | RTX 5090 | TensorRT | 2 ms | 18 ms | 11 ms | 31 ms | 32.1 Hz | | H100 | PyTorch Eager | 4 ms | 23 ms | 49 ms | 77 ms | 13.0 Hz | | H100 | torch.compile | 4 ms | 23 ms | 11 ms | 38 ms | 26.3 Hz | | H100 | TensorRT | 4 ms | 22 ms | 10 ms | 36 ms | 27.9 Hz | | RTX 4090 | PyTorch Eager | 2 ms | 25 ms | 55 ms | 82 ms | 12.2 Hz | | RTX 4090 | torch.compile | 2 ms | 25 ms | 17 ms | 44 ms | 22.8 Hz | | RTX 4090 | TensorRT | 2 ms | 24 ms | 16 ms | 43 ms | 23.3 Hz | | Orin | PyTorch Eager | 6 ms | 93 ms | 202 ms | 300 ms | 3.3 Hz | | Orin | torch.compile | 6 ms | 93 ms | 101 ms | 199 ms | 5.0 Hz | | Orin | TensorRT | 6 ms | 95 ms | 72 ms | 173 ms | 5.8 Hz |

Speedup vs PyTorch Eager

DeviceModeE2E SpeedupAction Head Speedup
RTX 5090torch.compile1.58x2.32x
RTX 5090TensorRT1.86x3.59x
H100torch.compile2.02x4.60x
H100TensorRT2.14x4.80x
RTX 4090torch.compile1.87x3.26x
RTX 4090TensorRT1.92x3.48x
Orintorch.compile1.50x2.00x
OrinTensorRT1.73x2.80x
Run python scripts/deployment/benchmark_inference.py to generate benchmarks for your hardware.

Architecture

The TensorRT optimization targets the DiT (Diffusion Transformer) component of the action head, which is the main computational bottleneck during inference.
┌─────────────────────────────────────────────────────────────┐
│                    GR00T Policy                             │
│  ┌───────────────┐  ┌───────────────┐  ┌─────────────────┐  │
│  │ Vision Encoder│  │Language Model │  │  Action Head    │  │
│  │(Cosmos-Reason)│──│(Cosmos-Reason)│──│    (DiT)        │  │
│  └───────────────┘  └───────────────┘  └─────────────────┘  │
│                                              ▲              │
│                                              │              │
│                                    ┌─────────┴─────────┐    │
│                                    │ TensorRT Engine   │    │
│                                    │ (dit_model.trt)   │    │
│                                    └───────────────────┘    │
└─────────────────────────────────────────────────────────────┘

Troubleshooting

Engine build fails

  • Ensure you have enough GPU memory (8GB+ recommended)
  • Try reducing workspace size: --workspace 4096
  • Ensure TensorRT version matches your CUDA version

ONNX export issues

  • If export fails, ensure the model loads correctly in PyTorch first
  • Check that the dataset path is valid and contains at least one trajectory

Build docs developers (and LLMs) love