Skip to main content

ResidualWalkEnv

A Gymnasium-compatible reinforcement learning environment for learning low-level residual corrections on top of a fixed diagonal Bézier gait controller. This environment focuses purely on foot position adjustments without modifying high-level gait parameters.

Overview

The ResidualWalkEnv provides a simpler learning problem than AdaptiveGaitEnv by keeping gait parameters (step height, cycle time, etc.) fixed. The policy learns only per-leg residual corrections—small 3D foot position offsets that compensate for terrain irregularities and improve locomotion quality.
Key Features
  • 65-dimensional observation space (no gait parameter feedback)
  • 12-dimensional action space (3D residual per leg)
  • Fixed gait parameters for consistent base behavior
  • Reward optimized for velocity tracking and contact quality

Class Definition

from envs.residual_walk_env import ResidualWalkEnv

env = ResidualWalkEnv(
    model_path="model/world_train.xml",
    gait_params=None,
    residual_scale=0.02,
    max_episode_steps=1000,
    settle_steps=500,
    seed=None
)

Constructor Parameters

model_path
str
default:"model/world_train.xml"
Path to the MuJoCo XML model file defining the robot and environment.
gait_params
GaitParameters | None
default:"None"
Fixed gait parameters for the base controller. If None, uses controller defaults. These parameters remain constant throughout training.
residual_scale
float
default:"0.02"
Scaling factor for residual corrections. Actions in [-1, 1] are multiplied by this value to get foot position offsets in meters.
max_episode_steps
int
default:"1000"
Maximum number of simulation steps per episode before truncation.
settle_steps
int
default:"500"
Number of steps to run with zero residuals at episode start. This allows the robot to settle into a stable gait on the terrain before learning begins.
seed
int | None
default:"None"
Random seed for reproducibility.

Spaces

Observation Space

Shape: (65,) - Box space with values in [-inf, inf] The observation is a concatenation of the following components:
body_state
float[13]
Robot body state in world frame:
  • Position (3D): x, y, z coordinates in meters
  • Orientation (4D): quaternion [w, x, y, z] (normalized)
  • Linear velocity (3D): body velocity in m/s
  • Angular velocity (3D): body angular velocity in rad/s
joint_states
float[24]
Joint positions and velocities for all 12 joints (3 per leg × 4 legs):
  • Positions (12D): joint angles in radians
  • Velocities (12D): joint angular velocities in rad/s
foot_positions
float[12]
3D positions of each foot in body frame (3D per leg × 4 legs).
foot_velocities
float[12]
3D velocities of each foot in body frame (3D per leg × 4 legs).
foot_contacts
float[4]
Binary contact indicators for each leg (FL, FR, RL, RR):
  • 1.0: foot in contact with ground
  • 0.0: foot not in contact
Unlike AdaptiveGaitEnv, this observation does NOT include current gait parameters since they are fixed.

Action Space

Shape: (12,) - Box space with values in [-1.0, 1.0] Actions represent 3D foot position residual corrections:
residual_corrections
float[12]
3D foot position offsets for each leg:
  • FL residual [0:3]: Front-Left leg (x, y, z)
  • FR residual [3:6]: Front-Right leg (x, y, z)
  • RL residual [6:9]: Rear-Left leg (x, y, z)
  • RR residual [9:12]: Rear-Right leg (x, y, z)
Each value in [-1, 1] is multiplied by residual_scale to get the actual offset in meters.

Methods

reset

def reset(
    self,
    *,
    seed: int | None = None,
    options: dict | None = None
) -> tuple[np.ndarray, dict]
Resets the environment to initial state.
seed
int | None
Optional seed for episode randomization.
options
dict | None
Additional options:
  • randomize (bool): If True, applies small random perturbations to initial position and orientation
Returns: (observation, info)
  • observation: Initial 65D observation vector
  • info: Empty dict (metadata for future use)
During reset, the environment runs settle_steps steps with zero residuals to allow the robot to stabilize on the terrain before the episode begins.

step

def step(
    self,
    action: np.ndarray
) -> tuple[np.ndarray, float, bool, bool, dict]
Executes one simulation step with the given action.
action
np.ndarray
12D action vector in range [-1, 1].
Returns: (observation, reward, terminated, truncated, info)
observation
np.ndarray
65D observation of new state.
reward
float
Scalar reward value (see Reward Function section).
terminated
bool
True if episode ended due to failure condition (robot fell over).
truncated
bool
True if episode reached max_episode_steps.
info
dict
Additional information:
  • reward_components (dict): Breakdown of reward by component
  • body_height (float): Current body height in meters

set_terrain_scale

def set_terrain_scale(self, scale: float) -> None
Placeholder method for curriculum learning support. Currently stores the scale value for compatibility with training callbacks.
scale
float
Terrain difficulty scale (future use).

Reward Function

The reward function is computed in _compute_reward() and focuses on velocity tracking and gait quality:

Forward Velocity Tracking

expected_velocity = step_length / cycle_time
vel_error = abs(forward_vel - expected_velocity)
reward = 300.0 * (1.0 - vel_error)
Tracks the expected velocity derived from gait parameters. This is more principled than simply maximizing velocity, as it accounts for the gait’s natural speed.

Contact Pattern Reward

# For each leg:
if is_swing and has_contact:      # Swing leg touching (BAD)
    contact_reward -= 0.2
elif is_swing and not has_contact:  # Swing leg in air (GOOD)
    contact_reward += 0.05
elif not is_swing and has_contact:  # Stance leg on ground (GOOD)
    contact_reward += 0.05
else:                               # Stance leg floating (BAD)
    contact_reward -= 0.2
Encourages proper contact patterns: swing legs should be in the air, stance legs should be on the ground.

Stability Penalty

tilt_penalty = -3.0 * (roll² + pitch² + yaw²)
Penalizes body orientation deviations to maintain upright posture.

Lateral Stability

lateral_error = abs(body_pos_y)
penalty = -1.0 * lateral_error
Penalizes lateral drift from the initial y-position to maintain straight-line walking. Total Reward: Sum of all components

Termination Conditions

Terminated (Failure)

Episode terminates early if:
  • |roll| > π/3 (60°) - robot tipped sideways
  • |pitch| > π/3 (60°) - robot tipped forward/backward

Truncated (Timeout)

Episode truncates when step_count >= max_episode_steps.

Usage Examples

Basic Training Setup

import numpy as np
from envs.residual_walk_env import ResidualWalkEnv

# Create environment
env = ResidualWalkEnv(
    model_path="model/world_train.xml",
    residual_scale=0.02,
    max_episode_steps=1000,
    settle_steps=500
)

# Reset and run episode
obs, info = env.reset(seed=42)
total_reward = 0.0

for _ in range(1000):
    # Random policy (replace with trained policy)
    action = env.action_space.sample()
    
    obs, reward, terminated, truncated, info = env.step(action)
    total_reward += reward
    
    if terminated or truncated:
        print(f"Episode finished with reward: {total_reward}")
        print(f"Reward breakdown: {info['reward_components']}")
        print(f"Final body height: {info['body_height']:.3f}m")
        break

With Custom Gait Parameters

from gait_controller import GaitParameters
from envs.residual_walk_env import ResidualWalkEnv

# Define custom gait (these remain fixed during training)
custom_gait = GaitParameters(
    step_height=0.05,
    step_length=0.10,
    cycle_time=0.8,
    body_height=0.06
)

env = ResidualWalkEnv(
    gait_params=custom_gait,
    residual_scale=0.025
)

obs, info = env.reset()
print(f"Expected velocity: {env.expected_velocity:.3f} m/s")

Training with PPO

from stable_baselines3 import PPO
from envs.residual_walk_env import ResidualWalkEnv

# Create environment
env = ResidualWalkEnv(
    residual_scale=0.02,
    max_episode_steps=1000
)

# Train PPO agent
model = PPO(
    "MlpPolicy",
    env,
    learning_rate=3e-4,
    n_steps=2048,
    batch_size=64,
    verbose=1
)

model.learn(total_timesteps=1_000_000)
model.save("residual_walk_policy")

With Randomization

env = ResidualWalkEnv()

# Reset with randomized initial conditions for robustness
obs, info = env.reset(
    seed=123,
    options={"randomize": True}
)

Expected Velocity Calculation

The environment automatically computes expected forward velocity from gait parameters:
expected_velocity = step_length / cycle_time
This value is used in the velocity tracking reward component and represents the natural forward speed of the base gait controller. Example: With step_length=0.08m and cycle_time=0.6s, the expected velocity is 0.133 m/s.

Comparison with AdaptiveGaitEnv

FeatureResidualWalkEnvAdaptiveGaitEnv
Observation dim6569
Action dim1216
Gait parametersFixedAdaptive
Learning complexityLowerHigher
Terrain adaptationLimitedFull
Use caseBaseline, simple terrainsComplex terrains, advanced control
Start with ResidualWalkEnv for initial experiments and baselines. Once residual learning is working well, progress to AdaptiveGaitEnv for full adaptive capabilities.

AdaptiveGaitEnv

Extended environment with adaptive gait parameters

BezierGaitController

Base Bézier gait controller used in this environment

Build docs developers (and LLMs) love