Skip to main content
TensorRT provides the fastest GR00T inference by compiling the DiT action head to GPU-specific optimized kernels. This guide covers the complete workflow from ONNX export to TensorRT inference.

Performance gains

TensorRT delivers significant speedups across all GPU platforms:
DevicePyTorch EagerTensorRTSpeedup
RTX 509058 ms (17.3 Hz)31 ms (32.1 Hz)1.86x
H10077 ms (13.0 Hz)36 ms (27.9 Hz)2.14x
RTX 409082 ms (12.2 Hz)43 ms (23.3 Hz)1.92x
Thor117 ms (8.6 Hz)92 ms (10.9 Hz)1.27x
Orin300 ms (3.3 Hz)173 ms (5.8 Hz)1.73x

Prerequisites

Installation

Install TensorRT dependencies:
uv sync --extra tensorrt
This installs:
  • ONNX export tools
  • TensorRT Python bindings
  • Additional optimization libraries

Hardware requirements

  • CUDA-enabled GPU with 8GB+ VRAM (recommended)
  • Compatible CUDA version (12.4 recommended, 11.8 also supported)
  • Sufficient disk space (~2GB for engine cache)

Complete workflow

1
Export model to ONNX
2
Convert the DiT action head to ONNX format:
3
python scripts/deployment/export_onnx_n1d6.py \
  --model-path nvidia/GR00T-N1.6-3B \
  --dataset-path demo_data/gr1.PickNPlace \
  --embodiment-tag GR1 \
  --output-dir ./groot_n1d6_onnx
4
Output: ./groot_n1d6_onnx/dit_model.onnx
5
This captures the input shapes from a sample trajectory and exports only the action head component.
6
Build TensorRT engine
7
Compile the ONNX model to a TensorRT engine:
8
python scripts/deployment/build_tensorrt_engine.py \
  --onnx ./groot_n1d6_onnx/dit_model.onnx \
  --engine ./groot_n1d6_onnx/dit_model_bf16.trt \
  --precision bf16
9
Output: ./groot_n1d6_onnx/dit_model_bf16.trt
10
Engine build takes 5-10 minutes depending on GPU. The engine is GPU-specific and needs to be rebuilt for different GPU architectures.
11
Run inference with TensorRT
12
Use the compiled engine for accelerated inference:
13
python scripts/deployment/standalone_inference_script.py \
  --model-path nvidia/GR00T-N1.6-3B \
  --dataset-path demo_data/gr1.PickNPlace \
  --embodiment-tag GR1 \
  --traj-ids 0 1 2 \
  --inference-mode tensorrt \
  --trt-engine-path ./groot_n1d6_onnx/dit_model_bf16.trt \
  --action-horizon 8

ONNX export options

The export_onnx_n1d6.py script supports these arguments:
ArgumentDefaultDescription
--model-path(required)Model checkpoint path
--dataset-path(required)Dataset for capturing input shapes
--embodiment-tagGR1Embodiment tag
--output-dir./groot_n1d6_onnxOutput directory
--video-backendtorchcodecVideo backend (decord, torchvision_av, torchcodec)

Example: Export for custom embodiment

python scripts/deployment/export_onnx_n1d6.py \
  --model-path /path/to/finetuned/checkpoint \
  --dataset-path /path/to/custom/dataset \
  --embodiment-tag NEW_EMBODIMENT \
  --output-dir ./custom_onnx

TensorRT engine build options

The build_tensorrt_engine.py script provides fine-grained control:
ArgumentDefaultDescription
--onnx(required)Path to ONNX model
--engine(required)Path to save TensorRT engine
--precisionbf16Precision mode
--workspace8192Workspace size in MB

Precision modes

Best balance of speed and accuracy:
python scripts/deployment/build_tensorrt_engine.py \
  --onnx ./groot_n1d6_onnx/dit_model.onnx \
  --engine ./groot_n1d6_onnx/dit_model_bf16.trt \
  --precision bf16

FP16

Higher numerical precision:
python scripts/deployment/build_tensorrt_engine.py \
  --onnx ./groot_n1d6_onnx/dit_model.onnx \
  --engine ./groot_n1d6_onnx/dit_model_fp16.trt \
  --precision fp16

FP32

Full precision (slowest):
python scripts/deployment/build_tensorrt_engine.py \
  --onnx ./groot_n1d6_onnx/dit_model.onnx \
  --engine ./groot_n1d6_onnx/dit_model_fp32.trt \
  --precision fp32

FP8

Maximum speed (requires Ada Lovelace or newer):
python scripts/deployment/build_tensorrt_engine.py \
  --onnx ./groot_n1d6_onnx/dit_model.onnx \
  --engine ./groot_n1d6_onnx/dit_model_fp8.trt \
  --precision fp8
FP8 requires RTX 40-series or newer GPUs. Verify your GPU supports FP8 before using this mode.

Workspace size

Increase workspace for complex models or reduce for memory-constrained environments:
# Larger workspace for better optimization
python scripts/deployment/build_tensorrt_engine.py \
  --onnx ./groot_n1d6_onnx/dit_model.onnx \
  --engine ./groot_n1d6_onnx/dit_model_bf16.trt \
  --precision bf16 \
  --workspace 16384  # 16GB

# Smaller workspace for GPUs with limited memory
python scripts/deployment/build_tensorrt_engine.py \
  --onnx ./groot_n1d6_onnx/dit_model.onnx \
  --engine ./groot_n1d6_onnx/dit_model_bf16.trt \
  --precision bf16 \
  --workspace 4096  # 4GB

Inference script arguments

Key arguments for standalone_inference_script.py:
ArgumentDefaultDescription
--inference-modepytorchMust be tensorrt
--trt-engine-path./groot_n1d6_onnx/dit_model_bf16.trtTensorRT engine path
--denoising-steps4Number of denoising steps
--action-horizon16Action horizon

Performance tuning

Denoising steps

Fewer denoising steps = faster inference, but may reduce action quality:
# Fast inference (2 steps)
python scripts/deployment/standalone_inference_script.py \
  --inference-mode tensorrt \
  --trt-engine-path ./groot_n1d6_onnx/dit_model_bf16.trt \
  --denoising-steps 2

# Balanced (4 steps, recommended)
python scripts/deployment/standalone_inference_script.py \
  --inference-mode tensorrt \
  --trt-engine-path ./groot_n1d6_onnx/dit_model_bf16.trt \
  --denoising-steps 4

# High quality (8 steps)
python scripts/deployment/standalone_inference_script.py \
  --inference-mode tensorrt \
  --trt-engine-path ./groot_n1d6_onnx/dit_model_bf16.trt \
  --denoising-steps 8

Benchmark engine performance

Measure TensorRT speedup on your hardware:
python scripts/deployment/benchmark_inference.py \
  --model-path nvidia/GR00T-N1.6-3B \
  --dataset-path demo_data/gr1.PickNPlace \
  --embodiment-tag GR1 \
  --trt-engine-path ./groot_n1d6_onnx/dit_model_bf16.trt \
  --num-iterations 20 \
  --warmup 5
Output includes component-wise timing:
=== TensorRT Benchmark ===
Data Processing: 2 ms
Backbone: 18 ms
Action Head: 11 ms  (3.59x faster than eager)
E2E: 31 ms  (1.86x faster than eager)
Frequency: 32.1 Hz

Platform-specific notes

Jetson platforms (Thor, Orin)

Experiments on Thor used CUDA 13, PyTorch 2.9 from Jetson AI Lab cu130 index. Orin used CUDA 12.6, PyTorch 2.8 from Jetson AI Lab cu126 index.
Optimize workspace for embedded GPUs:
# Jetson Orin (64GB)
python scripts/deployment/build_tensorrt_engine.py \
  --onnx ./groot_n1d6_onnx/dit_model.onnx \
  --engine ./groot_n1d6_onnx/dit_model_bf16.trt \
  --precision bf16 \
  --workspace 4096

RTX 5090

RTX 5090 tested with CUDA 12.8, flash-attn==2.8.0.post2, pytorch-cu128. Requires uv v0.8.4+.
Leverage large VRAM for maximum workspace:
python scripts/deployment/build_tensorrt_engine.py \
  --onnx ./groot_n1d6_onnx/dit_model.onnx \
  --engine ./groot_n1d6_onnx/dit_model_bf16.trt \
  --precision bf16 \
  --workspace 16384

Troubleshooting

Engine build fails

Symptoms: Build crashes or fails during optimization Solutions:
  1. Reduce workspace size:
    --workspace 4096
    
  2. Verify GPU memory:
    nvidia-smi
    
  3. Check TensorRT version matches CUDA:
    python -c "import tensorrt; print(tensorrt.__version__)"
    

ONNX export issues

Symptoms: Export fails with shape mismatch errors Solutions:
  1. Verify model loads in PyTorch:
    from gr00t.policy.gr00t_policy import Gr00tPolicy
    policy = Gr00tPolicy(
        embodiment_tag="GR1",
        model_path="nvidia/GR00T-N1.6-3B"
    )
    
  2. Check dataset path contains valid trajectories:
    ls demo_data/gr1.PickNPlace/
    

Engine not portable between GPUs

TensorRT engines are GPU-specific. An engine built on RTX 4090 will not work on H100. Rebuild the engine on each target platform.
Solution: Build separate engines for each GPU architecture:
# On RTX 4090
python scripts/deployment/build_tensorrt_engine.py \
  --onnx ./groot_n1d6_onnx/dit_model.onnx \
  --engine ./engines/dit_model_rtx4090.trt \
  --precision bf16

# On H100  
python scripts/deployment/build_tensorrt_engine.py \
  --onnx ./groot_n1d6_onnx/dit_model.onnx \
  --engine ./engines/dit_model_h100.trt \
  --precision bf16

Slow first inference

Symptoms: First inference takes much longer than subsequent ones Expected behavior: TensorRT engines have warmup overhead Solution: Add warmup iterations:
python scripts/deployment/standalone_inference_script.py \
  --inference-mode tensorrt \
  --trt-engine-path ./groot_n1d6_onnx/dit_model_bf16.trt \
  --skip-timing-steps 5  # Skip first 5 steps from timing

Out of memory during build

Symptoms: CUDA out of memory error during engine compilation Solutions:
  1. Reduce workspace:
    --workspace 2048
    
  2. Close other GPU processes:
    nvidia-smi
    kill <pid>
    
  3. Use FP32 instead of FP16/BF16 (uses less memory during build):
    --precision fp32
    

Architecture details

TensorRT optimizes only the DiT action head:
┌─────────────────────────────────────────────────────────────┐
│                    GR00T Policy                             │
│  ┌───────────────┐  ┌───────────────┐  ┌─────────────────┐  │
│  │ Vision Encoder│  │Language Model │  │  Action Head    │  │
│  │(Cosmos-Reason)│──│(Cosmos-Reason)│──│    (DiT)        │  │
│  └───────────────┘  └───────────────┘  └─────────────────┘  │
│         │                  │                     ▲          │
│         │                  │                     │          │
│         └──────────────────┘           ┌─────────┴────────┐ │
│          (PyTorch Eager)                │ TensorRT Engine  │ │
│                                         │ (dit_model.trt)  │ │
│                                         └──────────────────┘ │
└─────────────────────────────────────────────────────────────┘
The backbone (vision encoder + language model) remains in PyTorch, while the action head runs in TensorRT for maximum performance.

Advanced configuration

Custom layer precision

Override precision for specific layers (requires editing build script):
import tensorrt as trt

# In build_tensorrt_engine.py
config.set_flag(trt.BuilderFlag.FP16)
for layer in network:
    if "attention" in layer.name:
        layer.precision = trt.float32  # Force FP32 for attention layers

DLA acceleration (Jetson)

Offload layers to Deep Learning Accelerator on Jetson:
config.default_device_type = trt.DeviceType.DLA
config.DLA_core = 0

Dynamic shapes

For variable batch sizes or action horizons, configure dynamic shapes during export (requires code modification).

Build docs developers (and LLMs) love