Overview
The robot employs Proximal Policy Optimization (PPO) reinforcement learning to adapt its gait for rough terrain locomotion. The learned controller adds residual corrections to a baseline kinematic gait, enabling robust performance on irregular surfaces.This approach combines the stability of hand-designed gaits with the adaptability of learned policies, achieving the best of both worlds.
Two Learning Approaches
The codebase supports two RL strategies:1. Residual Learning (Original)
Learn low-level corrections to baseline foot trajectories:- Baseline gait controller generates nominal foot positions
- Policy outputs 3D offset vectors per leg (12D action)
- Total action space: 12 dimensions
2. Adaptive Gait Learning (Extended)
Learn both high-level parameters and low-level residuals:- Policy modulates gait parameters (step height, length, cycle time, body height)
- Policy adds per-leg residual corrections
- Total action space: 16 dimensions (4 params + 12 residuals)
envs/adaptive_gait_env.py:1-10:
This documentation focuses on the Adaptive Gait Learning approach (AdaptiveGaitEnv), which is the more advanced method.
PPO Algorithm
Proximal Policy Optimization (Schulman et al., 2017) is used via Stable Baselines 3. From README.md:5:Why PPO?
✅ Sample efficient: Learns from fewer environment interactions ✅ Stable training: Clipped objective prevents destructive policy updates ✅ On-policy: Suitable for continuous control with changing dynamics ✅ Proven track record: State-of-art for robotic locomotion tasksPPO Key Concepts
PPO Key Concepts
Clipped Surrogate Objective:Where:
r(θ)= probability ratio (new policy / old policy)A= advantage estimate (how much better than expected)ε= clip parameter (typically 0.2)
Observation Space (69D)
Fromenvs/adaptive_gait_env.py:62-68:
Observation Components
Body State (13D)
Fromenvs/adaptive_gait_env.py:165-173:
Joint States (24D)
All 12 joints (3 per leg × 4 legs):- 12D joint positions (angles in radians)
- 12D joint velocities (angular velocity in rad/s)
Foot Information (24D)
Fromenvs/adaptive_gait_env.py:179-183:
Foot Contacts (4D)
Binary indicators (0.0 or 1.0) for ground contact per leg. Fromenvs/adaptive_gait_env.py:186-188:
Current Gait Parameters (4D)
Normalized values of current gait configuration: Fromenvs/adaptive_gait_env.py:190-205:
Action Space (16D)
Fromenvs/adaptive_gait_env.py:70-73 and 121-125:
Action Processing
Fromenvs/adaptive_gait_env.py:212-237:
Action Scaling
Fromenvs/adaptive_gait_env.py:77-83:
[-1, +1] and scaled to reasonable parameter adjustments:
- step_height: Action of +1.0 → +5mm increase per timestep
- residuals: Action of +1.0 → +10mm offset (if
residual_scale=0.01)
These scales are per timestep, not per episode. Parameters accumulate changes over time, with limits enforced by the controller.
Reward Function
Fromenvs/adaptive_gait_env.py:259-300:
Reward Components
-
Forward velocity (primary objective)
- Weight: 2000.0× forward velocity (m/s)
- Encourages fast locomotion in +X direction
-
Lateral velocity penalty
- Weight: -2.0× |lateral velocity|
- Discourages sideways drift
-
Contact pattern reward
- +0.2 for correct swing (no contact) or stance (has contact)
- -0.5 for incorrect foot state
- Encourages maintaining proper diagonal trot
-
Stability penalty
- Weight: -2.0× (roll² + pitch² + yaw²)
- Penalizes body orientation deviations
The reward is dense (received every timestep), not sparse. This helps PPO learn faster by providing continuous feedback about policy quality.
Termination Conditions
Fromenvs/adaptive_gait_env.py:302-311:
- Terminated: Robot falls over (roll/pitch > 60°)
- Truncated: Episode reaches maximum length (default 60,000 steps)
Training Environment
Fromenvs/adaptive_gait_env.py:85-114:
Key Configuration
- model_path:
world_train.xmlcontains rough terrain with heightfield - residual_scale: 0.01 m (10mm) maximum residual correction
- max_episode_steps: 60,000 steps ≈ 30 seconds at 2000 Hz
- settle_steps: Pre-simulation steps to stabilize before training
Performance Results
From README.md:129-136 (test comparison summary):Quantitative Metrics (17-second trials)
| Controller | Terrain | Distance | Velocity | vs Baseline Rough |
|---|---|---|---|---|
| Baseline | Flat | 0.506 m | 0.030 m/s | +69.2% |
| Baseline | Rough | 0.299 m | 0.018 m/s | Baseline |
| Adaptive RL | Rough | 3.191 m | 0.188 m/s | +967.2% |
The adaptive RL controller achieves nearly 10× improvement over the baseline on rough terrain, demonstrating the power of learned adaptation.
Training Workflow
From README.md:77-87:Trained Model Files
final_model.zip: PPO policy network weightsvec_normalize.pkl: Observation normalization statistics (mean/std)
Why normalize observations?
Why normalize observations?
From the VecNormalize wrapper (Stable Baselines 3):
- Observations have vastly different scales (positions in meters, velocities in m/s, angles in radians)
- Neural networks train better with inputs normalized to ~[-1, +1] range
- Running mean/std are computed during training and frozen during evaluation
- Improves learning speed by 2-5× in practice
Adaptive vs Residual Control
Residual-Only Approach
- Policy only adjusts foot positions
- Gait parameters fixed at design time
- Simpler action space (12D)
Adaptive Gait Approach (Current)
- Policy adjusts both strategy and tactics
- Can learn to step higher on rough terrain
- Can slow down cycle time for stability
- More expressive but harder to train (16D action space)
The adaptive approach allows the policy to learn hierarchical control: high-level gait strategies (when to take bigger steps) combined with low-level corrections (where exactly to place each foot).
Related Topics
- Gait Control - Baseline controller that RL builds upon
- Robot Design - Hardware constraints affecting learning
- Inverse Kinematics - Converting learned actions to joint commands