Overview
This script trains a reinforcement learning policy that learns to modulate gait parameters (step height, step length, cycle time, body height) in addition to residual joint corrections. This enables the quadruped robot to adapt to rough terrain by adjusting both high-level gait characteristics and low-level motor commands. Key Features:- 16D action space (4 gait parameter deltas + 12 joint residuals)
- 69D observation space (includes current gait parameters)
- PPO algorithm with parallel environment training
- Automatic checkpointing and TensorBoard logging
- VecNormalize for observation and reward normalization
Usage
TrainingConfig dataclass. To modify training settings, edit the dataclass in the source file.
Training Configuration
All training parameters are controlled via theTrainingConfig dataclass:
Duration Settings
Total number of environment steps to train for (30 million)
Parallelization Settings
Number of parallel environments for training. Higher values increase throughput but require more CPU cores and memory.
Vectorized environment backend:
"subproc": Uses multiprocessing for true parallelism (recommended)"dummy": Sequential execution for debugging
PPO Hyperparameters
Number of steps to collect per environment before each policy update. Total rollout buffer size =
n_steps * n_envs.Minibatch size for each gradient descent step during policy optimization.
Learning rate for the Adam optimizer (3e-4).
Discount factor for future rewards.
Lambda parameter for Generalized Advantage Estimation (GAE).
Number of epochs to train on each rollout buffer.
Entropy coefficient for exploration bonus.
Clipping range for PPO objective (epsilon in the PPO paper).
Maximum gradient norm for gradient clipping.
Network Architecture
Neural network architecture size:
"small": [128, 64]"medium": [256, 128, 64]"large": [512, 256, 128]
Scaling factor applied to residual joint corrections. Lower values make residuals more conservative.
Logging and Checkpointing
Base name for the training run. Output directory will be
{run_name}_{timestamp}.Root directory for training outputs.
Save checkpoint every N environment steps.
Miscellaneous
Enable domain randomization (not currently implemented in the environment).
PyTorch device:
"auto": Automatically select CUDA if available, otherwise CPU"cpu": Force CPU"cuda": Force CUDA
Random seed for reproducibility. If
None, uses system randomness.Output Files
Training artifacts are saved toruns/{run_name}_{timestamp}/:
Key Files
final_model.zip: Trained policy weights (actor + critic networks)vec_normalize.pkl: Running mean/std statistics for observations and rewards. Required for evaluation.checkpoints/: Intermediate models saved everycheckpoint_freqstepsmonitor_*.csv: Episode returns, lengths, and timestamps for each parallel environmentconfig.txt: Human-readable training configuration
Training Dynamics
With default settings:- Steps per iteration: 344,064 (84 envs × 4,096 steps)
- Updates per iteration: ~1,680 (10 epochs × 344,064 / 2,048 batch size)
- Total iterations: ~87 (30M / 344,064)
- Checkpoint frequency: Every 500k steps (~1.45 iterations)
Example: Monitoring Training
View TensorBoard Logs
rollout/ep_rew_mean: Average episode rewardtrain/policy_loss: Policy gradient losstrain/value_loss: Critic (value function) losstrain/entropy_loss: Policy entropy (exploration)
Analyze Monitor Logs
Expected Behavior
A well-trained adaptive gait policy should:- Increase step height on rough terrain to avoid stumbling
- Adjust step length to balance speed vs. stability
- Modulate cycle time based on terrain difficulty (slower on rough terrain)
- Maintain appropriate body height to avoid collisions
Differences from Residual-Only Training
| Aspect | Adaptive Gait | Residual-Only |
|---|---|---|
| Action space | 16D (4 params + 12 residuals) | 12D (residuals only) |
| Observation space | 69D (includes gait params) | 65D |
| Control level | High-level + low-level | Low-level only |
| Terrain adaptation | Explicit gait modulation | Implicit via residuals |