Benchmarks and examples

This section provides comprehensive examples and benchmarks for evaluating and fine-tuning GR00T models across various robotic tasks and embodiments.

Available benchmarks

LIBERO

Lifelong robot learning benchmark with spatial reasoning, object generalization, and long-horizon tasks

SimplerEnv

Real-world robot manipulation policy evaluation framework with GPU-accelerated simulations

BEHAVIOR

50 household tasks testing loco-manipulation capabilities with Galaxea R1 Pro

RoboCasa

Large-scale kitchen simulation with 2,500+ 3D assets and 100 diverse manipulation tasks

G1 locomanipulation

Whole-body control tasks for Unitree G1 humanoid robot

DROID

Real-world manipulation tasks using the DROID dataset

PointNav

Point navigation tasks with COMPASS-generated datasets

SO-100

Teleoperation and deployment for SO-100 robot arms

Evaluation approach

GR00T supports a two-stage evaluation workflow:

Open-loop evaluation

Offline assessment comparing predicted actions against ground truth data from your dataset. This provides quick validation of model accuracy:

uv run python gr00t/eval/open_loop_eval.py \
    --dataset-path <DATASET_PATH> \
    --embodiment-tag <EMBODIMENT_TAG> \
    --model-path <CHECKPOINT_PATH> \
    --traj-ids 0 \
    --action-horizon 16

Closed-loop evaluation

Testing in simulation environments using a server-client architecture: Server (GPU machine):

uv run python gr00t/eval/run_gr00t_server.py \
    --embodiment-tag <EMBODIMENT_TAG> \
    --model-path <CHECKPOINT_PATH> \
    --use-sim-policy-wrapper

Client (simulation environment):

uv run python gr00t/eval/rollout_policy.py \
    --n_episodes 10 \
    --policy_client_host 127.0.0.1 \
    --policy_client_port 5555 \
    --env_name <ENV_NAME> \
    --n_action_steps 8

Pre-registered embodiments

GR00T provides several pre-registered embodiment tags with ready-to-use configurations:

LIBERO_PANDA - Franka Emika Panda for LIBERO tasks
OXE_GOOGLE - Google Robot for manipulation
OXE_WIDOWX - WidowX robot for Bridge dataset
UNITREE_G1 - Unitree G1 humanoid for loco-manipulation
BEHAVIOR_R1_PRO - Galaxea R1 Pro for household tasks
ROBOCASA_PANDA_OMRON - Panda with Omron gripper for kitchen tasks
OXE_DROID - DROID dataset embodiment

Training variance

You may observe performance variance of 5-6% between training runs, even with identical configurations and seeds. This is due to non-deterministic operations in image augmentations and other stochastic components. Keep this inherent variance in mind when comparing to reported benchmarks.

Getting started

Select a benchmark from the cards above to view detailed setup instructions, fine-tuning commands, evaluation procedures, and reported results.

Overview

Getting Started

Core Concepts

Guides

Benchmarks & Examples

Deployment

Resources

Benchmarks and examples

Available benchmarks

LIBERO

SimplerEnv

BEHAVIOR

RoboCasa

G1 locomanipulation

DROID

PointNav

SO-100

Evaluation approach

Open-loop evaluation

Closed-loop evaluation

Pre-registered embodiments

Training variance

Getting started

Build docs developers (and LLMs) love

Overview

Getting Started

Core Concepts

Guides

Benchmarks & Examples

Deployment

Resources

Documentation Index

​Available benchmarks

LIBERO

SimplerEnv

BEHAVIOR

RoboCasa

G1 locomanipulation

DROID

PointNav

SO-100

​Evaluation approach

​Open-loop evaluation

​Closed-loop evaluation

​Pre-registered embodiments

​Training variance

​Getting started

Build docs developers (and LLMs) love

Available benchmarks

Evaluation approach

Open-loop evaluation

Closed-loop evaluation

Pre-registered embodiments

Training variance

Getting started