We recommend a two-stage evaluation approach: open-loop evaluation followed by closed-loop evaluation to comprehensively assess model quality.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/NVIDIA/Isaac-GR00T/llms.txt
Use this file to discover all available pages before exploring further.
Open-loop evaluation
Open-loop evaluation provides an offline assessment by comparing the model’s predicted actions against ground truth data from your dataset.Running the evaluation
Execute the evaluation script with your trained model:Parameters
| Parameter | Description |
|---|---|
--dataset-path | Path to your dataset in LeRobot format |
--embodiment-tag | Embodiment tag for your robot |
--model-path | Path to the trained model checkpoint |
--traj-ids | List of trajectory IDs to evaluate |
--action-horizon | Action horizon (must be within the delta_indices of action’s modality config) |
--steps | Maximum number of steps to evaluate |
--modality-keys | List of modality keys to plot |
Interpreting results
The evaluation generates a visualization saved at/tmp/open_loop_eval/traj_{traj_id}.jpeg, which includes:
- Ground truth actions vs. predicted actions
- Unnormalized mean squared error (MSE) metrics
- Unnormalized mean absolute error (MAE) metrics
Closed-loop evaluation
After validating performance through open-loop evaluation, test your model in closed-loop environments.Server-client architecture
GR00T uses a server-client architecture for closed-loop evaluation, which allows you to:- Run policy inference on a GPU server while controlling the robot/simulation from a different machine
- Avoid dependency conflicts between the policy and environment code
- Easily switch between different policies without modifying environment code
Starting the policy server
Launch the server using therun_gr00t_server.py script:
Parameters
| Parameter | Description |
|---|---|
--embodiment-tag | The embodiment tag for your robot |
--model-path | Path to your trained model checkpoint directory |
--device | Device to run inference on (cuda:0, cuda:1, cpu, etc.) |
--host | Host address (127.0.0.1 for local only, 0.0.0.0 to accept external connections) |
--port | Port number (default: 5555) |
--strict | Enable input/output validation (default: True) |
--use-sim-policy-wrapper | Whether to use Gr00tSimPolicyWrapper for GR00T simulation environments |
Using the policy client
On the client side, usePolicyClient to connect to the server:
Running simulation evaluation
For simulation environments, use therollout_policy.py script:
Parameters
| Parameter | Description |
|---|---|
--n_episodes | Number of episodes to run |
--policy_client_host | Host address of the policy server |
--policy_client_port | Port number of the policy server |
--max_episode_steps | Maximum number of steps per episode |
--env_name | Name of the gym environment |
--n_action_steps | Number of action steps to execute per inference |
--n_envs | Number of parallel environments |
Debugging with ReplayPolicy
When developing a new environment integration or debugging your inference loop, you can useReplayPolicy to replay recorded actions from an existing dataset:
ReplayPolicy is an excellent first step when integrating a new environment. Debug with replay first, then switch to model inference once the pipeline is validated.
policy.reset(options={"episode_index": N}) on the client to switch to a different episode.
If your environment is set up correctly, replaying ground-truth actions should achieve high (often 100%) success rates. Low success rates indicate issues with:
- Environment reset state not matching the dataset
- Observation preprocessing differences
- Action space mismatches
Available benchmarks
GR00T supports evaluation on several public benchmarks:Zero-shot evaluation
- RoboCasa: General manipulation tasks
- RoboCasa GR1 Tabletop Tasks: GR1-specific tabletop manipulation
Fine-tuned evaluation
- G1 LocoManipulation: Whole-body control tasks
- LIBERO: Long-horizon manipulation benchmarks
- SimplerEnv: Google robot and WidowX environments
- BEHAVIOR: Household tasks with the Galaxea R1 Pro
- PointNav: Navigation tasks
- SO-100: Custom robot demonstrations
examples/ directory for detailed setup instructions for each benchmark.