Available benchmarks
LIBERO
Lifelong robot learning benchmark with spatial reasoning, object generalization, and long-horizon tasks
SimplerEnv
Real-world robot manipulation policy evaluation framework with GPU-accelerated simulations
BEHAVIOR
50 household tasks testing loco-manipulation capabilities with Galaxea R1 Pro
RoboCasa
Large-scale kitchen simulation with 2,500+ 3D assets and 100 diverse manipulation tasks
G1 locomanipulation
Whole-body control tasks for Unitree G1 humanoid robot
DROID
Real-world manipulation tasks using the DROID dataset
PointNav
Point navigation tasks with COMPASS-generated datasets
SO-100
Teleoperation and deployment for SO-100 robot arms
Evaluation approach
GR00T supports a two-stage evaluation workflow:Open-loop evaluation
Offline assessment comparing predicted actions against ground truth data from your dataset. This provides quick validation of model accuracy:Closed-loop evaluation
Testing in simulation environments using a server-client architecture: Server (GPU machine):Pre-registered embodiments
GR00T provides several pre-registered embodiment tags with ready-to-use configurations:LIBERO_PANDA- Franka Emika Panda for LIBERO tasksOXE_GOOGLE- Google Robot for manipulationOXE_WIDOWX- WidowX robot for Bridge datasetUNITREE_G1- Unitree G1 humanoid for loco-manipulationBEHAVIOR_R1_PRO- Galaxea R1 Pro for household tasksROBOCASA_PANDA_OMRON- Panda with Omron gripper for kitchen tasksOXE_DROID- DROID dataset embodiment
Training variance
You may observe performance variance of 5-6% between training runs, even with identical configurations and seeds. This is due to non-deterministic operations in image augmentations and other stochastic components. Keep this inherent variance in mind when comparing to reported benchmarks.