Installation issues
CUDA version compatibility
If you’re using a different CUDA version:- Make sure to install a compatible version of
flash-attnmanually - For CUDA 11.8,
flash-attn==2.8.2has been confirmed to work - For RTX 5090, use CUDA 12.8 with
flash-attn==2.8.0.post2andpytorch-cu128
UV version too old
Error:uv fails to parse [tool.uv.extra-build-dependencies] in pyproject.toml
Solution: Upgrade to uv v0.8.4 or later:
Submodules not initialized
Error: Missing dependencies or import errors for submodule packages Solution: Initialize submodules if you didn’t clone with--recurse-submodules:
Docker issues
GPU not detected in container
Symptoms: CUDA not available when runningnvidia-smi inside the container
Solutions:
-
Verify NVIDIA Container Toolkit is installed:
-
Restart Docker daemon:
-
Test GPU access:
Permission errors with Docker
Error: Permission denied when running Docker commands Solution: Add your user to thedocker group:
Build failures
Symptoms: Docker build fails with disk space or network errors Solutions:-
Check disk space:
-
Clean Docker cache:
-
Rebuild without cache:
Policy and inference issues
RuntimeError: Cannot connect to policy server
Error:RuntimeError("Cannot connect to policy server!")
Solutions:
-
Verify the server is running:
-
Check host and port match between server and client:
-
Try using IP address instead of hostname:
Observation format mismatch
Error: Shape or dtype errors when callingpolicy.get_action()
Solution: Enable strict mode and check modality configs:
Data type errors
Error:TypeError: expected dtype uint8 but got float32 or similar
Solution: Ensure correct data types:
- Videos must be
np.uint8arrays with RGB pixel values in range [0, 255] - States must be
np.float32arrays - Language instructions are lists of lists of strings
TensorRT errors
Error:RuntimeError: CUDA not available for TensorRT or engine loading fails
Solutions:
-
Verify CUDA is available:
-
Check TensorRT installation:
-
Rebuild TensorRT engine if it was built on a different system:
Training issues
Out of memory (OOM) errors
Symptoms: CUDA out of memory during training Solutions:-
Reduce batch size:
-
Reduce dataloader workers:
-
Optimize dataloading parameters:
-
Use gradient accumulation (if supported):
Training variance
Issue: Performance varies significantly between training runs with the same configurationUsers may observe some variance in post-training results across runs, even when using the same configuration, seed, and dropout settings. In our experiments, we have observed performance differences as large as 5-6% between runs.
- Non-deterministic operations in image augmentations
- Stochastic components in the training pipeline
- Hardware differences
- Run multiple training runs and select the best checkpoint
- Use validation metrics to track performance
- Keep this inherent variance in mind when comparing to reported benchmarks
Slow dataloading
Issue: Training is bottlenecked by data loading Solutions:-
Increase number of workers:
-
Increase shard size for better IID sampling:
-
Reduce episode sampling rate for more IID sampling:
Simulation evaluation issues
EGL/GLX errors
Error: Failed to initialize EGL or GLX libraries Solution: Ensure necessary graphics libraries are installed:Environment prefix not found
Error:ValueError: Unknown environment prefix
Solution: Register your environment prefix in gr00t/eval/sim/env_utils.py:
The env_name prefix and the
EmbodimentTag value are often different. For example, libero_sim maps to EmbodimentTag.LIBERO_PANDA.Low success rate with ReplayPolicy
Issue: Ground-truth action replay achieves low success rates Possible causes:- Environment reset state not matching the dataset
- Observation preprocessing differences
- Action space mismatches
- Execution horizon mismatch
Data preparation issues
Video decoding errors
Error:RuntimeError: Simulated decode error or video file corruption
Solutions:
-
Verify video file integrity:
-
Re-encode video with compatible settings:
-
Check video backend:
Dataset conversion issues
Error: Missing episodes or incorrect data format after conversion Solutions:-
Verify source dataset structure:
-
Check conversion script output for errors:
-
Validate converted dataset:
Getting help
If you’re still experiencing issues:- Check existing issues: Search GitHub issues to see if your problem has been reported
-
Verify environment: Run the verification script:
-
Enable debug logging: Add verbose flags to scripts:
-
Report a bug: If you’ve found a new issue, open a GitHub issue with:
- Clear description of the problem
- Steps to reproduce
- Error messages and stack traces
- Environment information (CUDA version, GPU model, OS, etc.)