gr00t/eval/open_loop_eval.py) compares policy predictions against ground truth actions from demonstration datasets without actually executing the actions in an environment.
Usage
Parameters
Host to connect to when using policy server instead of local model.
Port to connect to when using policy server instead of local model.
Maximum number of steps to evaluate per trajectory. Will be capped by actual trajectory length.
List of trajectory IDs to evaluate from the dataset.Example:
--traj-ids 0 1 2 3 evaluates trajectories 0 through 3.Action horizon for policy inference. The policy predicts this many future actions at each step.
Path to the LeRobot format dataset containing demonstration trajectories.
Embodiment tag identifying the robot configuration. See embodiment tags for available options.
Path to the model checkpoint directory. If not provided, connects to policy server using
host and port.Example: checkpoints/checkpoint-1000Number of denoising steps to use during diffusion policy inference.
Path to save trajectory comparison plots. If not provided, saves to
/tmp/open_loop_eval/traj_{id}.jpeg.List of action modality keys to evaluate and plot. If None, evaluates all action keys in the dataset.Example:
--modality-keys joint_positions gripperOutputs
The script outputs:- MSE (Mean Squared Error): Unnormalized mean squared error between predicted and ground truth actions
- MAE (Mean Absolute Error): Unnormalized mean absolute error between predicted and ground truth actions
- Trajectory plots: Visual comparison of state, ground truth actions, and predicted actions saved as JPEG files
Plot format
Each plot shows:- State joints trajectory (if action space matches state space)
- Ground truth actions from the demonstration
- Predicted actions from the policy
- Red dots indicating inference points (every
action-horizonsteps)
Example workflows
Evaluate with local model
Evaluate with policy server
Evaluate specific action modalities
Implementation details
The evaluation process:-
Loads the dataset using
LeRobotEpisodeLoader -
For each trajectory:
- Extracts observations at intervals of
action-horizonsteps - Runs policy inference to predict action chunks
- Compares predicted actions against ground truth
- Computes MSE and MAE metrics
- Generates comparison plots
- Extracts observations at intervals of
- Reports average metrics across all trajectories
Open-loop evaluation measures prediction accuracy but doesn’t account for compounding errors that occur during closed-loop execution in simulation or on real robots.