Baseline vs Adaptive Comparison

Overview

The tests/compare_baseline_adaptive.py script provides automated performance comparison between:

Step 1: Baseline gait on flat terrain (upper bound)
Step 2: Baseline gait on rough terrain (performance degradation)
Step 3: Adaptive RL policy on rough terrain (learned recovery)

This three-way comparison visualizes how much performance degrades on rough terrain and how much the trained policy recovers.

Quick Start

Ensure Model is Trained

You need a trained model from the Training Models guide:

ls runs/adaptive_gait_*/final_model.zip

Should show your trained model file.

Run Comparison Script

Execute the three-simulation comparison:

cd ~/workspace/source
python3 tests/compare_baseline_adaptive.py \
    --model runs/adaptive_gait_20260304_143022/final_model.zip \
    --normalize runs/adaptive_gait_20260304_143022/vec_normalize.pkl \
    --seconds 17

Replace 20260304_143022 with your actual training run timestamp.

Watch Simulations

Three MuJoCo viewer windows will open sequentially:

Step 1: Baseline on flat terrain (smooth walking)
Step 2: Baseline on rough terrain (struggling)
Step 3: Adaptive policy on rough terrain (adapted walking)

Each runs for 17 seconds by default.

View Results

After all simulations complete:

Comparison plot opens automatically
Summary statistics printed to console
Files saved in tests/ directory

Command-Line Options

python3 tests/compare_baseline_adaptive.py \
    --model <path_to_model.zip> \
    --normalize <path_to_vec_normalize.pkl> \
    --seconds <duration> \
    --output <plot_path>

--model

string

required

Path to trained PPO model (.zip file)Example: runs/adaptive_gait_20260304_143022/final_model.zip

--normalize

string

required

Path to VecNormalize statistics (.pkl file)Example: runs/adaptive_gait_20260304_143022/vec_normalize.pkl

--seconds

float

default:"17.0"

Duration for each simulation in seconds

Longer runs = more stable statistics
Shorter runs = faster iteration
Recommended: 15-20 seconds

--output

string

default:"tests/baseline_vs_adaptive_comparison.png"

Output path for the comparison plotFormat: PNG image (150 DPI)

Understanding the Output

Three-Panel Plot

The script generates a side-by-side comparison:

Step 1: Baseline (Flat)
Step 2: Baseline (Rough)
Step 3: Adaptive (Rough)

Left Panel - Reference PerformanceShows baseline gait controller on ideal terrain:

Smooth, linear progression
Consistent forward velocity
No obstacles or disturbances

Use: Upper bound for performanceColor: Blue lineStats Box:

Distance: 1.234m
Avg Vel: 0.073m/s

Middle Panel - Performance DegradationShows baseline gait controller on challenging terrain:

Irregular progression (obstacles)
Reduced forward velocity
Possible backwards motion (falling)

Use: Demonstrates need for adaptationColor: Red lineStats Box:

Distance: 0.456m
Avg Vel: 0.027m/s
vs Flat: -63.1%

Negative percentage indicates performance degradation on rough terrain.

Right Panel - Learned RecoveryShows trained RL policy on challenging terrain:

More stable than baseline on rough terrain
Higher forward velocity than Step 2
Adapted gait parameters + residuals

Use: Demonstrates learning effectivenessColor: Green lineStats Box:

Distance: 1.102m
Avg Vel: 0.065m/s
vs Step2: +141.7%

Positive percentage indicates improvement over baseline on rough terrain.

Console Summary

After simulations complete, you’ll see:

===============================================================================================
COMPARISON SUMMARY - THREE SIMULATIONS
===============================================================================================

Metric                         Step 1: Baseline     Step 2: Baseline     Step 3: Adaptive    
                               (Flat)               (Rough)              (Rough)             
-----------------------------------------------------------------------------------------------
Duration (s)                   17.00                17.00                17.00               
Data points                    170                  170                  170                 
Start X (m)                    0.000                0.000                0.000               
End X (m)                      1.234                0.456                1.102               
Distance traveled (m)          1.234                0.456                1.102               
Average velocity (m/s)         0.073                0.027                0.065               

Performance Comparison:
  Step 2 vs Step 1 (Rough vs Flat):      -63.1%
  Step 3 vs Step 2 (Adaptive vs Rough):  +141.7%
===============================================================================================

Interpreting Results

Good Learning Outcome

Indicators:

Step 2 (Baseline Rough) shows significant degradation: -40% to -80%
Step 3 (Adaptive Rough) shows large improvement over Step 2: +50% to +200%
Step 3 approaches Step 1 performance: within 10-20% of flat terrain baseline

Example:

Step 1 (Flat):  1.200m @ 0.071m/s
Step 2 (Rough): 0.400m @ 0.024m/s  [-66.7%]
Step 3 (Adapt): 1.050m @ 0.062m/s  [+162.5%]

✅ Policy successfully learned terrain adaptation

Marginal Learning

Indicators:

Step 3 shows small improvement over Step 2: +10% to +30%
Still significantly below Step 1 performance: -30% to -50%

Example:

Step 1 (Flat):  1.200m @ 0.071m/s
Step 2 (Rough): 0.500m @ 0.029m/s  [-58.3%]
Step 3 (Adapt): 0.650m @ 0.038m/s  [+30.0%]

⚠️ Policy learned some adaptation but not enoughSolutions:

Train longer (increase total_timesteps)
Tune reward function
Increase network size
Adjust hyperparameters

No Learning / Regression

Indicators:

Step 3 similar to or worse than Step 2: -10% to +5%
Far below Step 1 performance

Example:

Step 1 (Flat):  1.200m @ 0.071m/s
Step 2 (Rough): 0.450m @ 0.026m/s  [-62.5%]
Step 3 (Adapt): 0.430m @ 0.025m/s  [-4.4%]

❌ Policy did not learn effectivelyLikely Causes:

Training diverged or plateaued early
Learning rate too high
Observation space issues
Reward function not aligned with task

Action: Review TensorBoard metrics, retrain with adjusted config

Saved Files

The script saves several files in the tests/ directory:

tests/trajectory_step1_baseline_flat.json
tests/trajectory_step2_baseline_rough.json
tests/trajectory_step3_adaptive_rough.json

Trajectory Data Format

Each JSON file contains:

duration

float

Total simulation time in seconds

data_points

integer

Number of recorded positions

terrain

string

Terrain type: "flat" or "rough"

mode

string

Control mode: "baseline" or "adaptive"

trajectory

array

Array of position records:

time: Simulation time (seconds)
x, y, z: Robot body position (meters)

Advanced Usage

Comparing Multiple Checkpoints

Test different training checkpoints:

# Test checkpoint at 5M steps
python3 tests/compare_baseline_adaptive.py \
    --model runs/adaptive_gait_20260304_143022/checkpoints/rl_model_5000000_steps.zip \
    --normalize runs/adaptive_gait_20260304_143022/vec_normalize.pkl \
    --seconds 17 \
    --output tests/comparison_5M.png

# Test final model at 30M steps
python3 tests/compare_baseline_adaptive.py \
    --model runs/adaptive_gait_20260304_143022/final_model.zip \
    --normalize runs/adaptive_gait_20260304_143022/vec_normalize.pkl \
    --seconds 17 \
    --output tests/comparison_30M.png

Compare the plots to see learning progression.

Longer Evaluation Runs

For more stable statistics:

python3 tests/compare_baseline_adaptive.py \
    --model runs/adaptive_gait_20260304_143022/final_model.zip \
    --normalize runs/adaptive_gait_20260304_143022/vec_normalize.pkl \
    --seconds 60 \
    --output tests/comparison_60s.png

Longer runs reduce variance but take more time.

Batch Comparison Script

Compare all checkpoints automatically:

batch_compare.sh

#!/bin/bash

RUN_DIR="runs/adaptive_gait_20260304_143022"
OUTPUT_DIR="tests/checkpoint_comparison"
mkdir -p "$OUTPUT_DIR"

for MODEL in "$RUN_DIR"/checkpoints/rl_model_*_steps.zip; do
    STEP=$(basename "$MODEL" | grep -oP '\d+(?=_steps)')
    echo "Testing checkpoint: $STEP steps"
    
    python3 tests/compare_baseline_adaptive.py \
        --model "$MODEL" \
        --normalize "$RUN_DIR/vec_normalize.pkl" \
        --seconds 17 \
        --output "$OUTPUT_DIR/comparison_${STEP}.png"
done

echo "All comparisons complete in $OUTPUT_DIR/"

Troubleshooting

Simulation Crashes or Exits Early

Symptoms: MuJoCo viewer closes before 17 secondsCauses:

Robot fell and simulation terminated
Episode length limit reached
Model NaN values (diverged training)

Solutions:

Check training metrics for divergence
Try an earlier checkpoint
Reduce simulation duration (--seconds 10)
Review model loading errors in console

FileNotFoundError for Model

Error: FileNotFoundError: [Errno 2] No such file or directory: 'runs/.../final_model.zip'Solutions:

List available runs:
```
ls -lh runs/
```
Use correct timestamp in path
Ensure training completed successfully
Check for final_model.zip in run directory:
```
ls -lh runs/adaptive_gait_*/
```

Plot Looks Strange

Symptoms: Backwards motion, flat lines, extreme valuesInterpretations:

Backwards motion: Robot is falling/flipping
Flat line: Robot stuck or not moving
Extreme jumps: Simulation instability

Debugging:

Watch the MuJoCo viewer during runs
Check trajectory JSON files manually

Run each simulation individually:

python3 play_adaptive_policy.py --baseline --flat --seconds 17
python3 play_adaptive_policy.py --baseline --seconds 17
python3 play_adaptive_policy.py --model ... --normalize ... --seconds 17

Adaptive Worse Than Baseline

Symptoms: Step 3 shows negative improvement vs Step 2This indicates training failed. Check:

TensorBoard metrics:
```
tensorboard --logdir runs/
```
- Did reward increase?
- Did episode length increase?
- When did training plateau?
Try an earlier checkpoint (may have diverged)
Retrain with adjusted hyperparameters:
- Lower learning rate: 3e-4 → 1e-4
- Increase entropy coefficient: 0.01 → 0.05
- More environments: 84 → 128

Implementation Details

How It Works

The comparison script (tests/compare_baseline_adaptive.py):

Run Simulation 1

run_simulation(
    baseline=True,
    flat_terrain=True,
    duration=17.0,
    output_file="tests/trajectory_step1_baseline_flat.json"
)

Calls play_adaptive_policy.py with --baseline --flat flags.

Run Simulation 2

run_simulation(
    baseline=True,
    flat_terrain=False,
    duration=17.0,
    output_file="tests/trajectory_step2_baseline_rough.json"
)

Calls play_adaptive_policy.py with --baseline flag only.

Run Simulation 3

run_simulation(
    baseline=False,
    model_path="runs/.../final_model.zip",
    normalize_path="runs/.../vec_normalize.pkl",
    duration=17.0,
    output_file="tests/trajectory_step3_adaptive_rough.json"
)

Calls play_adaptive_policy.py with --model and --normalize flags.

Generate Plot

plot_comparison(
    baseline_flat_data,
    baseline_rough_data,
    adaptive_rough_data,
    output_file="tests/baseline_vs_adaptive_comparison.png"
)

Creates three-panel matplotlib figure with statistics.

Trajectory Recording

The play_adaptive_policy.py script records trajectory when --save-trajectory is provided:

# Every 0.1 seconds
trajectory_data.append({
    "time": data.time,
    "x": robot_pos[0],
    "y": robot_pos[1],
    "z": robot_pos[2]
})

# At end of simulation
with open(output_file, 'w') as f:
    json.dump({
        "duration": duration,
        "data_points": len(trajectory_data),
        "terrain": terrain_type,
        "mode": mode,
        "trajectory": trajectory_data
    }, f, indent=2)

Get Started

Core Concepts

Usage Guides

Development

Overview

Quick Start

Command-Line Options

Understanding the Output

Three-Panel Plot

Console Summary

Interpreting Results

Saved Files

Trajectory Data Format

Advanced Usage

Comparing Multiple Checkpoints

Longer Evaluation Runs

Batch Comparison Script

Troubleshooting

Implementation Details

How It Works

Trajectory Recording

Next Steps

Retrain with Tuning

Deploy Best Model

Build docs developers (and LLMs) love

Get Started

Core Concepts

Usage Guides

Development

​Overview

​Quick Start

​Command-Line Options

​Understanding the Output

​Three-Panel Plot

​Console Summary

​Interpreting Results

​Saved Files

​Trajectory Data Format

​Advanced Usage

​Comparing Multiple Checkpoints

​Longer Evaluation Runs

​Batch Comparison Script

​Troubleshooting

​Implementation Details

​How It Works

​Trajectory Recording

​Next Steps

Retrain with Tuning

Deploy Best Model

Build docs developers (and LLMs) love

Overview

Quick Start

Command-Line Options

Understanding the Output

Three-Panel Plot

Console Summary

Interpreting Results

Saved Files

Trajectory Data Format

Advanced Usage

Comparing Multiple Checkpoints

Longer Evaluation Runs

Batch Comparison Script

Troubleshooting

Implementation Details

How It Works

Trajectory Recording

Next Steps