Skip to main content
The closed-loop evaluation script (gr00t/eval/rollout_policy.py) executes policies in simulation environments and measures task completion success rates.

Usage

python gr00t/eval/rollout_policy.py \
  --env-name gr1_unified/PnPCanToDrawerClose_GR1ArmsAndWaistFourierHands_Env \
  --model-path checkpoints/checkpoint-5000 \
  --n-episodes 50 \
  --n-envs 8 \
  --n-action-steps 8 \
  --max-episode-steps 504

Parameters

env-name
str
required
Name of the gymnasium environment to evaluate. Must be registered in one of:
  • RoboCasa environments (prefix: robocasa_panda_omron/, gr1_unified/)
  • SimplerEnv environments (prefix: simpler_env_google/, simpler_env_widowx/)
  • LIBERO environments (prefix: libero_sim/)
  • BEHAVIOR environments (prefix: sim_behavior_r1_pro/)
  • GR00T LocoManip environments (prefix: gr00tlocomanip_g1_sim/)
model-path
str
default:""
Path to the model checkpoint directory. Required if not using policy-client-host.Example: checkpoints/checkpoint-5000
policy-client-host
str
default:""
Host address of the policy server. Use this with policy-client-port instead of model-path to connect to a remote policy server.
policy-client-port
int
default:"None"
Port number of the policy server. Required when using policy-client-host.
n-episodes
int
default:"50"
Number of episodes to run for evaluation.
n-envs
int
default:"8"
Number of parallel environments to run simultaneously. Automatically uses AsyncVectorEnv for n-envs > 1.
n-action-steps
int
default:"8"
Number of action steps to execute from each policy prediction. This is the execution horizon.
max-episode-steps
int
default:"504"
Maximum number of steps per episode before truncation.

Outputs

The script outputs:
  • Success rate: Percentage of episodes that completed the task successfully
  • Episode info: Additional metrics like task progress, episode lengths, and environment-specific scores
  • Videos: Saved to /tmp/sim_eval_videos_{model_name}_ac{n_action_steps}_{uuid}/ (except for BEHAVIOR environments)

Example output

Running collecting 50 episodes for gr1_unified/PnPCanToDrawerClose with 8 vec envs
Episodes: 100%|██████████| 50/50 [12:34<00:00,  1.51s/it]
Collecting 50 episodes took 754.2 seconds
results: ('gr1_unified/PnPCanToDrawerClose', [True, True, False, ...], {...})
success rate: 0.76
Video saved to: /tmp/sim_eval_videos_checkpoint-5000_ac8_a1b2c3d4/

Supported environments

RoboCasa (GR1 and Panda)

python gr00t/eval/rollout_policy.py \
  --env-name gr1_unified/PnPCanToDrawerClose_GR1ArmsAndWaistFourierHands_Env \
  --model-path checkpoints/gr1_model

SimplerEnv (Google Robot and WidowX)

python gr00t/eval/rollout_policy.py \
  --env-name simpler_env_google/google_robot_pick_coke_can \
  --model-path checkpoints/google_robot_model \
  --n-action-steps 16

LIBERO (Panda manipulation)

python gr00t/eval/rollout_policy.py \
  --env-name libero_sim/LIVING_ROOM_SCENE2_put_both_the_alphabet_soup_and_the_tomato_sauce_in_the_basket \
  --model-path checkpoints/libero_model

BEHAVIOR (R1 Pro humanoid)

python gr00t/eval/rollout_policy.py \
  --env-name sim_behavior_r1_pro/picking_up_trash \
  --model-path checkpoints/behavior_model \
  --n-envs 4 \
  --max-episode-steps 1440

GR00T LocoManipulation (G1)

python gr00t/eval/rollout_policy.py \
  --env-name gr00tlocomanip_g1_sim/LMPnPAppleToPlateDC_G1_gear_wbc \
  --model-path checkpoints/g1_model

Using policy server

For distributed evaluation, start a policy server and connect to it:
# Terminal 1: Start the policy server
python gr00t/eval/run_gr00t_server.py \
  --model-path checkpoints/checkpoint-5000 \
  --port 5555

# Terminal 2: Run evaluation
python gr00t/eval/rollout_policy.py \
  --env-name gr1_unified/PnPCanToDrawerClose_GR1ArmsAndWaistFourierHands_Env \
  --policy-client-host 127.0.0.1 \
  --policy-client-port 5555 \
  --n-episodes 50 \
  --n-envs 8

Environment wrappers

The evaluation script automatically applies:

MultiStepWrapper

Executes multiple action steps from each policy prediction:
  • video_delta_indices: Controls temporal stacking of video observations
  • state_delta_indices: Controls temporal stacking of state observations
  • n_action_steps: Number of actions to execute per inference
  • max_episode_steps: Maximum steps before truncation
  • terminate_on_success: Whether to end episode immediately on task success

VideoRecordingWrapper (optional)

Records videos of episodes:
  • Videos saved to /tmp/sim_eval_videos_{model_name}_ac{n_action_steps}_{uuid}/
  • Configurable FPS, codec, and quality settings
  • Automatically disabled for BEHAVIOR environments

Embodiment detection

The script automatically determines the embodiment tag from the environment name prefix:
Environment PrefixEmbodiment Tag
robocasa_panda_omron/ROBOCASA_PANDA_OMRON
gr1_unified/, gr1/GR1
gr00tlocomanip_g1_sim/UNITREE_G1
simpler_env_google/OXE_GOOGLE
simpler_env_widowx/OXE_WIDOWX
libero_sim/LIBERO_PANDA
sim_behavior_r1_pro/BEHAVIOR_R1_PRO
For BEHAVIOR environments, video recording is automatically disabled to avoid conflicts with the simulator’s internal rendering.

Build docs developers (and LLMs) love