Skip to main content

Overview

The train_adaptive_gait_ppo.py script trains a PPO (Proximal Policy Optimization) policy that learns to adapt gait parameters and apply residual corrections for robust locomotion on rough terrain.

What the Policy Learns

High-Level Control

Gait Parameter Adaptation (4D)
  • step_height - Foot clearance adjustment
  • step_length - Stride length modulation
  • cycle_time - Gait frequency tuning
  • body_height - Body clearance control

Low-Level Control

Residual Corrections (12D)
  • Fine-grained joint angle adjustments
  • 3 DOF per leg × 4 legs
  • Applied on top of base gait

Quick Start

1

Navigate to Source Directory

cd ~/workspace/source
2

Start Training

Run the training script with default configuration:
python3 train_adaptive_gait_ppo.py
Training will begin with 84 parallel environments.
3

Monitor Progress

Open TensorBoard in a new terminal to visualize training metrics:
tensorboard --logdir runs/
Navigate to http://localhost:6006 in your browser.
4

Wait for Completion

Training runs for 30 million timesteps by default (several hours):
  • Progress bar shows completion percentage
  • Checkpoints saved every 500,000 steps
  • Final model saved at completion

Training Configuration

All hyperparameters are defined in the TrainingConfig dataclass (train_adaptive_gait_ppo.py:42):
total_timesteps = 30_000_000  # 30M total steps
n_envs = 84                   # Parallel environments
vec_env_backend = "subproc"   # Multi-process execution

Understanding Training Output

When you start training, you’ll see:
================================================================================
Adaptive Gait Training Configuration
================================================================================
Run name:            adaptive_gait_20260304_143022
Log directory:       runs/adaptive_gait_20260304_143022
Total timesteps:     30,000,000
Parallel envs:       84
Steps per update:    4096
Batch size:          2048
Learning rate:       0.0003
Entropy coef:        0.01
N epochs:            10
Max grad norm:       1.0
Network size:        large
Residual scale:      0.01
Device:              auto
Vec env backend:     SubprocVecEnv
================================================================================

Key differences from residual-only training:
  - Action space: 16D (4 param deltas + 12 residuals)
  - Observation: 69D (includes current gait params)
  - Policy learns both high-level and low-level control
================================================================================

Creating environments...

Network architecture (large):
  Actor:  [512, 256, 128]
  Critic: [512, 256, 128]
  Activation: ELU

Initializing PPO model...

Training details:
  Steps per iteration:   344,064
  Updates per iteration: 1,680
  Total iterations:      87
  Checkpoint frequency:  Every 500,000 steps

================================================================================
Starting training...
================================================================================

Monitor training progress:
  TensorBoard: tensorboard --logdir runs/adaptive_gait_20260304_143022
  Checkpoints: runs/adaptive_gait_20260304_143022/checkpoints/

Monitoring with TensorBoard

Important Graphs to Watch

rollout/ep_rew_mean
  • Episode reward (higher is better)
  • Should increase over time
  • Indicates learning progress
rollout/ep_len_mean
  • Episode length in steps
  • Longer episodes = more stable walking
  • Should increase as policy improves
train/entropy_loss
  • Policy exploration measure
  • Decreases as policy becomes more deterministic
  • Too low too fast = premature convergence
train/policy_gradient_loss
  • How much policy is changing
  • Should decrease and stabilize

Training Outputs

After training completes, you’ll find these artifacts in runs/adaptive_gait_YYYYMMDD_HHMMSS/:
The trained PPO policy weights. Use this file to run the trained robot:
python3 play_adaptive_policy.py \
    --model runs/adaptive_gait_20260304_143022/final_model.zip \
    --normalize runs/adaptive_gait_20260304_143022/vec_normalize.pkl \
    --seconds 30 \
    --deterministic
Observation and reward normalization statistics. Required for inference:
  • Mean and standard deviation of observations
  • Running reward statistics
  • Must be loaded with the model for correct behavior
Intermediate model saves every 500,000 steps:
checkpoints/
├── rl_model_500000_steps.zip
├── rl_model_1000000_steps.zip
├── rl_model_1500000_steps.zip
└── ...
Useful for:
  • Recovering from training interruptions
  • Comparing policy evolution over time
  • Finding the best checkpoint if training diverges
Per-environment episode statistics (one file per parallel environment):
r,l,t
1234.56,5000,123.45
1456.78,6000,234.56
  • r: Episode reward
  • l: Episode length (steps)
  • t: Wall-clock time (seconds)
Complete training configuration for reproducibility:
Adaptive Gait Training Configuration
========================================
total_timesteps: 30000000
n_envs: 84
n_steps: 4096
batch_size: 2048
learning_rate: 0.0003
...

Customizing Training

To modify training parameters, edit the TrainingConfig dataclass in train_adaptive_gait_ppo.py:42:
total_timesteps = 10_000_000  # Reduce from 30M
n_envs = 16                   # Fewer parallel envs
network_size = "medium"       # Smaller network

Expected Training Time

With default configuration (30M timesteps, 84 environments):
HardwareApproximate Time
32+ CPU cores4-6 hours
16 CPU cores8-12 hours
8 CPU cores16-24 hours
Training time scales with total_timesteps / (n_envs × wall_clock_speed). More parallel environments significantly reduce training time.

Testing the Trained Policy

After training completes:
1

Run Trained Policy

python3 play_adaptive_policy.py \
    --model runs/adaptive_gait_20260304_143022/final_model.zip \
    --normalize runs/adaptive_gait_20260304_143022/vec_normalize.pkl \
    --seconds 30 \
    --deterministic
2

Compare with Baseline

See the Baseline vs Adaptive guide for detailed comparison workflow.

Troubleshooting

Reduce memory usage:
n_envs = 16          # Reduce from 84
n_steps = 2048       # Reduce from 4096
network_size = "medium"  # Reduce from "large"
Switch to single-process mode for debugging:
vec_env_backend = "dummy"
n_envs = 1
This makes errors easier to trace.
Check:
  • Episode length remains very low → environment may be too difficult
  • Reward not increasing → try reducing learning rate
  • High variance → increase n_envs for more stable gradients

Next Steps

Compare Performance

Quantitatively compare baseline vs trained policy

Deploy to Robot

Use the trained policy with ROS2 integration

Build docs developers (and LLMs) love