Skip to main content
We recommend a two-stage evaluation approach: open-loop evaluation followed by closed-loop evaluation to comprehensively assess model quality.

Open-loop evaluation

Open-loop evaluation provides an offline assessment by comparing the model’s predicted actions against ground truth data from your dataset.

Running the evaluation

Execute the evaluation script with your trained model:
python gr00t/eval/open_loop_eval.py \
    --dataset-path <DATASET_PATH> \
    --embodiment-tag NEW_EMBODIMENT \
    --model-path <CHECKPOINT_PATH> \
    --traj-ids 0 \
    --action-horizon 16 \
    --steps 400 \
    --modality-keys single_arm gripper

Parameters

ParameterDescription
--dataset-pathPath to your dataset in LeRobot format
--embodiment-tagEmbodiment tag for your robot
--model-pathPath to the trained model checkpoint
--traj-idsList of trajectory IDs to evaluate
--action-horizonAction horizon (must be within the delta_indices of action’s modality config)
--stepsMaximum number of steps to evaluate
--modality-keysList of modality keys to plot

Interpreting results

The evaluation generates a visualization saved at /tmp/open_loop_eval/traj_{traj_id}.jpeg, which includes:
  • Ground truth actions vs. predicted actions
  • Unnormalized mean squared error (MSE) metrics
  • Unnormalized mean absolute error (MAE) metrics
These plots provide a quick indicator of the policy’s accuracy on the training dataset distribution.
The evaluation script outputs both MSE and MAE metrics in the console for each trajectory, as well as average metrics across all evaluated trajectories.

Closed-loop evaluation

After validating performance through open-loop evaluation, test your model in closed-loop environments.

Server-client architecture

GR00T uses a server-client architecture for closed-loop evaluation, which allows you to:
  • Run policy inference on a GPU server while controlling the robot/simulation from a different machine
  • Avoid dependency conflicts between the policy and environment code
  • Easily switch between different policies without modifying environment code

Starting the policy server

Launch the server using the run_gr00t_server.py script:
python gr00t/eval/run_gr00t_server.py \
    --embodiment-tag NEW_EMBODIMENT \
    --model-path <CHECKPOINT_PATH> \
    --device cuda:0 \
    --host 0.0.0.0 \
    --port 5555

Parameters

ParameterDescription
--embodiment-tagThe embodiment tag for your robot
--model-pathPath to your trained model checkpoint directory
--deviceDevice to run inference on (cuda:0, cuda:1, cpu, etc.)
--hostHost address (127.0.0.1 for local only, 0.0.0.0 to accept external connections)
--portPort number (default: 5555)
--strictEnable input/output validation (default: True)
--use-sim-policy-wrapperWhether to use Gr00tSimPolicyWrapper for GR00T simulation environments

Using the policy client

On the client side, use PolicyClient to connect to the server:
from gr00t.policy.server_client import PolicyClient

# Connect to the policy server
policy = PolicyClient(
    host="localhost",  # or IP address of your GPU server
    port=5555,
    timeout_ms=15000,
)

# Verify connection
if not policy.ping():
    raise RuntimeError("Cannot connect to policy server!")

# Use just like a regular policy
observation = get_observation()  # Your observation in Policy API format
action, info = policy.get_action(observation)

Running simulation evaluation

For simulation environments, use the rollout_policy.py script:
python gr00t/eval/rollout_policy.py \
    --n_episodes 50 \
    --policy_client_host 127.0.0.1 \
    --policy_client_port 5555 \
    --max_episode_steps 720 \
    --env_name libero_sim/LIVING_ROOM_SCENE2_put_soup_in_basket \
    --n_action_steps 8 \
    --n_envs 8

Parameters

ParameterDescription
--n_episodesNumber of episodes to run
--policy_client_hostHost address of the policy server
--policy_client_portPort number of the policy server
--max_episode_stepsMaximum number of steps per episode
--env_nameName of the gym environment
--n_action_stepsNumber of action steps to execute per inference
--n_envsNumber of parallel environments

Debugging with ReplayPolicy

When developing a new environment integration or debugging your inference loop, you can use ReplayPolicy to replay recorded actions from an existing dataset:
# Start server with ReplayPolicy
python gr00t/eval/run_gr00t_server.py \
    --dataset-path <DATASET_PATH> \
    --embodiment-tag NEW_EMBODIMENT \
    --execution-horizon 8  # should match the executed action horizon in the environment
ReplayPolicy is an excellent first step when integrating a new environment. Debug with replay first, then switch to model inference once the pipeline is validated.
The server will replay actions from the first episode of the dataset. Use policy.reset(options={"episode_index": N}) on the client to switch to a different episode. If your environment is set up correctly, replaying ground-truth actions should achieve high (often 100%) success rates. Low success rates indicate issues with:
  • Environment reset state not matching the dataset
  • Observation preprocessing differences
  • Action space mismatches

Available benchmarks

GR00T supports evaluation on several public benchmarks:

Zero-shot evaluation

  • RoboCasa: General manipulation tasks
  • RoboCasa GR1 Tabletop Tasks: GR1-specific tabletop manipulation

Fine-tuned evaluation

  • G1 LocoManipulation: Whole-body control tasks
  • LIBERO: Long-horizon manipulation benchmarks
  • SimplerEnv: Google robot and WidowX environments
  • BEHAVIOR: Household tasks with the Galaxea R1 Pro
  • PointNav: Navigation tasks
  • SO-100: Custom robot demonstrations
Refer to the examples/ directory for detailed setup instructions for each benchmark.

Build docs developers (and LLMs) love