Skip to main content
The open-loop evaluation script (gr00t/eval/open_loop_eval.py) compares policy predictions against ground truth actions from demonstration datasets without actually executing the actions in an environment.

Usage

python gr00t/eval/open_loop_eval.py \
  --dataset-path demo_data/cube_to_bowl_5/ \
  --model-path checkpoints/checkpoint-1000 \
  --traj-ids 0 1 2 \
  --steps 200 \
  --action-horizon 16

Parameters

host
str
default:"127.0.0.1"
Host to connect to when using policy server instead of local model.
port
int
default:"5555"
Port to connect to when using policy server instead of local model.
steps
int
default:"200"
Maximum number of steps to evaluate per trajectory. Will be capped by actual trajectory length.
traj-ids
list[int]
default:"[0]"
List of trajectory IDs to evaluate from the dataset.Example: --traj-ids 0 1 2 3 evaluates trajectories 0 through 3.
action-horizon
int
default:"16"
Action horizon for policy inference. The policy predicts this many future actions at each step.
dataset-path
str
default:"demo_data/cube_to_bowl_5/"
Path to the LeRobot format dataset containing demonstration trajectories.
embodiment-tag
EmbodimentTag
default:"NEW_EMBODIMENT"
Embodiment tag identifying the robot configuration. See embodiment tags for available options.
model-path
str
default:"None"
Path to the model checkpoint directory. If not provided, connects to policy server using host and port.Example: checkpoints/checkpoint-1000
denoising-steps
int
default:"4"
Number of denoising steps to use during diffusion policy inference.
save-plot-path
str
default:"None"
Path to save trajectory comparison plots. If not provided, saves to /tmp/open_loop_eval/traj_{id}.jpeg.
modality-keys
list[str]
default:"None"
List of action modality keys to evaluate and plot. If None, evaluates all action keys in the dataset.Example: --modality-keys joint_positions gripper

Outputs

The script outputs:
  • MSE (Mean Squared Error): Unnormalized mean squared error between predicted and ground truth actions
  • MAE (Mean Absolute Error): Unnormalized mean absolute error between predicted and ground truth actions
  • Trajectory plots: Visual comparison of state, ground truth actions, and predicted actions saved as JPEG files

Plot format

Each plot shows:
  • State joints trajectory (if action space matches state space)
  • Ground truth actions from the demonstration
  • Predicted actions from the policy
  • Red dots indicating inference points (every action-horizon steps)

Example workflows

Evaluate with local model

python gr00t/eval/open_loop_eval.py \
  --model-path checkpoints/checkpoint-5000 \
  --dataset-path datasets/pick_place_demos \
  --traj-ids 0 1 2 3 4 \
  --steps 300 \
  --action-horizon 16 \
  --embodiment-tag GR1

Evaluate with policy server

# First, start the policy server
python gr00t/eval/run_gr00t_server.py \
  --model-path checkpoints/checkpoint-5000 \
  --port 5555

# Then run evaluation
python gr00t/eval/open_loop_eval.py \
  --host 127.0.0.1 \
  --port 5555 \
  --dataset-path datasets/pick_place_demos \
  --traj-ids 0 1 2

Evaluate specific action modalities

python gr00t/eval/open_loop_eval.py \
  --model-path checkpoints/checkpoint-5000 \
  --dataset-path datasets/bimanual_demos \
  --modality-keys left_arm_joints right_arm_joints \
  --save-plot-path results/trajectory_comparison.jpeg

Implementation details

The evaluation process:
  1. Loads the dataset using LeRobotEpisodeLoader
  2. For each trajectory:
    • Extracts observations at intervals of action-horizon steps
    • Runs policy inference to predict action chunks
    • Compares predicted actions against ground truth
    • Computes MSE and MAE metrics
    • Generates comparison plots
  3. Reports average metrics across all trajectories
Open-loop evaluation measures prediction accuracy but doesn’t account for compounding errors that occur during closed-loop execution in simulation or on real robots.

Build docs developers (and LLMs) love