Documentation Index
Fetch the complete documentation index at: https://mintlify.com/jackvice/RoboTerrain/llms.txt
Use this file to discover all available pages before exploring further.
The reward function is the primary mechanism by which the rover learns to navigate efficiently to goals while avoiding collisions and pathological behaviors like spinning in place or getting stuck. Every component is computed inside task_reward() in ros2_ws/src/sb3/environments/rover_env.py, which is called once per environment step after the rover publishes a velocity command. The function balances five competing objectives: goal-seeking, obstacle avoidance, forward progress, turning efficiency, and episode termination detection.
Reward Components
Goal Reward
When the rover reaches within success_distance (0.5 m) of the current target, the episode does not terminate — instead the target is immediately moved to a new random position and the agent receives a large positive reward. This encourages the agent to chain multiple goals within a single episode.
| Condition | Value |
|---|
current_distance < 0.5 m | +100.0 |
if current_distance < success_distance: # success_distance = 0.5
self.update_target_pos()
return self.goal_reward # goal_reward = 100
Distance Reward
Progress toward the goal is rewarded only when two conditions are both satisfied: the rover has moved a meaningful distance (|distance_delta| > 0.001 m) and the heading alignment is positive (heading_reward > 0.002, i.e., facing within ~90° of the target). This prevents the agent from earning distance reward by backing away from a target while facing it.
| Condition | Formula |
|---|
| Good heading + meaningful movement | distance_delta × distance_delta_scale × final_reward_multiplier |
| Otherwise (idle or bad heading) | −0.03 (idle penalty) |
distance_delta = self.previous_distance - current_distance # positive = got closer
if abs(distance_delta) > 0.001 and heading_reward > 0.002:
distance_reward = distance_delta * distance_delta_scale # × 0.3
else:
distance_reward = -0.03 # idle penalty
# Applied with final_reward_multiplier = 1.5
reward = (distance_reward * final_reward_multiplier) + ...
The effective gain per metre of progress toward the goal is 0.3 × 1.5 = 0.45.
Velocity Bonus
A small bonus proportional to the rover’s actual measured linear velocity (from odometry) is added to encourage the agent to maintain forward momentum rather than creeping:
velocity_bonus = self.current_linear_velocity * 0.0025
At maximum speed (~1.0 m/s), this contributes +0.0025 per step — small enough not to dominate but sufficient to break ties in favor of faster movement.
Turn Penalty
Unnecessary rotation is penalized proportionally to the measured angular velocity. This discourages spinning in place and rewards efficient straight-line driving:
turning_penalty_coeff = 0.0015
turn_penalty = -turning_penalty_coeff * abs(self.current_angular_velocity)
At the PID controller’s maximum output of 7.0 rad/s, this penalty reaches −0.0105 per step.
Collision Penalty
On each step, the minimum finite LIDAR reading is checked. If anything is closer than collision_threshold (0.3 m), a flat penalty is returned immediately — no other reward components are computed for that step.
| Condition | Value |
|---|
min(lidar_data) < 0.3 m | −0.5 |
min_distance = np.min(self.lidar_data[np.isfinite(self.lidar_data)])
if min_distance < collision_threshold: # collision_threshold = 0.3
return collision_penalty # -0.5
The collision_count counter also increments each time this fires. If collision_count exceeds 600 steps the episode terminates with general_penalty (see Episode Termination Conditions).
Idle Penalty
When the rover is either not moving meaningfully (|distance_delta| ≤ 0.001 m) or has a poor heading alignment (heading_reward ≤ 0.002), the distance reward is replaced by an idle penalty of −0.03. This creates constant pressure to make progress, discouraging the agent from hovering near an obstacle or oscillating.
Stuck Penalty
A sliding window of the last stuck_window (600) step positions is maintained. If the total displacement over that window is less than stuck_threshold (0.001 m), the episode terminates with a severe penalty:
| Condition | Value |
|---|
distance_moved < 0.001 m over 600 steps | −25.0 → episode ends |
if len(self.position_history) >= self.stuck_window: # 600 steps
start_pos = self.position_history[0]
end_pos = self.position_history[-1]
distance_moved = math.sqrt((end_pos[0] - start_pos[0])**2 +
(end_pos[1] - start_pos[1])**2)
if distance_moved < self.stuck_threshold: # 0.001 m
return self.get_observation(), self.stuck_penilty, True, False, {}
# stuck_penilty = -25.0
General Penalty
A general terminal penalty of −20.0 is applied for two severe failure modes:
- Robot flipped: roll or pitch exceeds 85° (
1.48 rad) and the flip occurred after step 500 (early flips from spawn are penalized 0 to avoid discouraging exploration).
- Stuck in collision:
collision_count exceeds stuck_window (600) — the rover has been in continuous collision proximity for over 600 consecutive steps.
| Condition | Value |
|---|
| Flipped (>85°, step > 500) | −20.0 → episode ends |
| Stuck in collision (> 600 steps) | −20.0 → episode ends |
Out-of-Bounds Penalty
Each world defines a rectangular boundary. If the rover’s position exceeds those bounds, the episode terminates with:
| Condition | Value |
|---|
| Position outside world boundary | −10.0 → episode ends |
World boundaries are set per environment at construction time:
| World | X bounds | Y bounds |
|---|
inspect | [−29, −13] | [−29, −17] |
moon (island) | [−20, 10] | [−20, 20] |
maze (default) | [−30, 30] | [−30, 30] |
Reward Function Source
The complete task_reward() method as implemented in rover_env.py:
def task_reward(self):
"""
Reward function that accounts for robot dynamics and gradual acceleration
"""
# Constants
final_reward_multiplier = 1.5
collision_threshold = 0.3
collision_penalty = -0.5
success_distance = 0.5
distance_delta_scale = 0.3
heading_tolerance = math.pi / 4 # 45 degrees
# Get current state info
distance_heading_info = self.get_target_info()
current_distance = distance_heading_info[0]
heading_diff = distance_heading_info[1]
# Initialize previous distance if needed
if self.previous_distance is None:
self.previous_distance = current_distance
return 0.0
# Check for goal achievement
if current_distance < success_distance:
self.update_target_pos()
return self.goal_reward
# Check for collisions
min_distance = np.min(self.lidar_data[np.isfinite(self.lidar_data)])
if min_distance < collision_threshold:
print('Collision!')
return collision_penalty
# Calculate distance change (positive means got closer)
distance_delta = self.previous_distance - current_distance
# Heading component
heading_diff = math.atan2(math.sin(heading_diff), math.cos(heading_diff))
abs_heading_diff = abs(heading_diff)
if abs_heading_diff <= math.pi / 2:
# From 0 to 90 degrees: scale from 1.0 to 0.0
heading_alignment = 1.0 - (2 * abs_heading_diff / math.pi)
else:
# From 90 to 180 degrees: scale from 0.0 to -1.0
heading_alignment = -2 * (abs_heading_diff - math.pi / 2) / math.pi
heading_reward = 0.01 * heading_alignment
# Distance component
if abs(distance_delta) > 0.001 and heading_reward > 0.002:
distance_reward = distance_delta * distance_delta_scale
else:
distance_reward = -0.03 # idle penalty
# Turn penalty
turning_penalty_coeff = 0.0015
turn_penalty = -turning_penalty_coeff * abs(self.current_angular_velocity)
# Combine rewards
reward = (
(distance_reward * final_reward_multiplier)
+ (self.current_linear_velocity * 0.0025)
+ turn_penalty
)
self.previous_distance = current_distance
return reward
Heading Alignment
The heading_alignment scalar converts the angular error between the rover’s current heading and the direction to the goal into a value in [−1, 1]. It uses a piecewise linear function with a break point at 90°:
heading_alignment
1.0 ┤╲
│ ╲
0.5 ┤ ╲
│ ╲
0.0 ┤────╲────────────── angle
0° 90° 180°
-0.5 ┤ ╲
│ ╲
-1.0 ┤ ╲
| Heading error range | heading_alignment formula | Result range | | |
|---|
| 0° – 90° | `1.0 − (2 × | θ | / π)` | 1.0 → 0.0 |
| 90° – 180° | `−2 × ( | θ | − π/2) / π` | 0.0 → −1.0 |
A heading_reward threshold of 0.002 is used as the gate for distance reward eligibility. Back-calculating: 0.01 × heading_alignment > 0.002 → heading_alignment > 0.2 → the rover must be within ~72° of the target bearing to earn forward progress reward.
Heading alignment is computed after normalizing heading_diff into [−π, π] using math.atan2(sin, cos). This prevents wrap-around discontinuities from affecting the reward signal.
Tunable Constants
| Constant | Default Value | Effect on Training |
|---|
collision_threshold | 0.3 m | Distance at which a collision penalty fires. Increase for a larger safety margin; decrease to allow the rover to operate closer to obstacles. |
collision_penalty | −0.5 | Per-step cost of a near-collision. Increasing magnitude makes the agent more collision-averse but can slow progress in tight spaces. |
success_distance | 0.5 m | Goal tolerance radius. Larger values make goals easier to claim; smaller values require more precise navigation. |
distance_delta_scale | 0.3 | Scales distance-progress reward before applying final_reward_multiplier. Higher values increase the reward gradient toward the goal. |
final_reward_multiplier | 1.5 | Global multiplier applied to distance_reward. Scales overall reward magnitude without changing relative component weights. |
goal_reward | 100 | Reward for reaching the goal. Should remain substantially larger than the maximum possible accumulated reward over a short episode to keep goal-seeking as the dominant objective. |
stuck_penalty | −25.0 | Terminal penalty for being stuck. Larger magnitude strongly discourages stationary behavior but can cause instability early in training. (Source variable is spelled stuck_penilty — typo in rover_env.py.) |
stuck_threshold | 0.001 m | Minimum displacement over stuck_window steps before a stuck penalty fires. |
stuck_window | 600 steps | Look-back window for stuck detection. Shorter windows detect stuck faster but may false-positive during deliberate slow traversal. |
turning_penalty_coeff | 0.0015 | Coefficient on angular velocity penalty. Tune upward to strongly discourage spinning; tune downward if the rover needs to make sharp turns frequently. |
Not exposed as a named constant in source but hardcoded as current_linear_velocity × 0.0025. Increase to encourage faster forward movement; decrease if the rover tends to over-speed into obstacles.
Applied per step when neither meaningful movement nor good heading alignment is detected. Acts as a constant pressure to make progress.
Terminal penalty for flip or extended collision events. Set at self.general_penility = -20 in __init__. (Source variable is spelled general_penility — typo in rover_env.py.)
Out-of-bounds terminal penalty. Set at self.too_far_away_penilty = -10 in __init__.
PID Controller
The rover’s heading is controlled by a PID controller (heading_controller()) that converts the desired absolute heading into an angular velocity command, clamped to max_angular_velocity. The action space outputs a relative heading offset, which is added to the current yaw before being fed into the PID.
# PID gains (set in __init__)
self.Kp = 2.0 # Proportional — main correction term
self.Ki = 0.0 # Integral — disabled (avoids wind-up on terrain)
self.Kd = 0.1 # Derivative — damping to reduce overshoot
self.max_angular_velocity = 7.0 # rad/s — output clamp
def heading_controller(self, desired_heading, current_heading):
now = time.time()
dt = now - self.last_time
self.last_time = now
# Shortest-path angular error in [-π, π]
error = math.atan2(
math.sin(desired_heading - current_heading),
math.cos(desired_heading - current_heading)
)
self.integral_error += error * dt
derivative_error = (error - self.last_error) / dt
self.last_error = error
control = (
self.Kp * error
+ self.Ki * self.integral_error
+ self.Kd * derivative_error
)
return np.clip(control, -self.max_angular_velocity, self.max_angular_velocity)
| Parameter | Value | Notes |
|---|
Kp | 2.0 | High proportional gain for responsive heading correction. |
Ki | 0.0 | Integral disabled — avoids integrator wind-up on uneven terrain where persistent heading error is expected. |
Kd | 0.1 | Light derivative damping reduces overshoot during sharp turns. |
max_angular_velocity | 7.0 rad/s | Hard clamp on PID output sent to cmd_vel. |
The integral term Ki = 0.0 means steady-state heading error is not corrected automatically. If your terrain consistently biases the rover off-course, consider enabling a small Ki (e.g., 0.01) along with an integrator wind-up guard.
Episode Termination Conditions
An episode ends (returning terminated=True) under any of the following conditions, checked in priority order within step():
-
Max steps reached —
_step >= length (default: 2000 steps). Returns done=True with the final computed reward; not a failure condition.
-
Stuck in collision —
collision_count > stuck_window (> 600 steps in near-collision proximity). Returns general_penalty (−20.0).
-
Robot flipped —
|roll| > 1.48 rad (~85°) or |pitch| > 1.48 rad. Returns general_penalty (−20.0) if the flip occurred after step 500; returns reward 0 for early-episode flips to avoid penalising random spawn orientations.
-
Robot stuck — total displacement over the last 600 steps is less than 0.001 m. Returns
stuck_penilty (−25.0).
-
Out of bounds — rover position exceeds world boundary box. Returns
too_far_away_penilty (−10.0).
# Priority order in step():
# 1. Stuck in collision
if self.collision_count > self.stuck_window: # > 600
return obs, self.general_penility, True, False, {} # -20.0
# 2. Flip detection
flip_status = self.is_robot_flipped() # |roll| or |pitch| > 1.48 rad
if flip_status:
if self._step > 500:
return obs, self.general_penility, True, False, {} # -20.0
else:
return obs, 0, True, False, {}
# 3. Stuck detection (sliding window over stuck_window=600 steps)
if distance_moved < self.stuck_threshold: # < 0.001 m
return obs, self.stuck_penilty, True, False, {} # -25.0
# 4. Out of bounds
if self.too_far_away():
return obs, self.too_far_away_penilty, True, False, {} # -10.0
# 5. Max steps (inside task_reward, done flag set after reward computation)
done = (self._step >= self._length) # default length = 2000
When starting a new training run, use ent_coef="auto_0.5" in your SAC or PPO configuration to let the algorithm self-tune its entropy coefficient. Monitor the entropy in TensorBoard: if it collapses below 0.5 early in training the agent is likely over-committing to suboptimal behaviors (e.g., always rotating). If entropy stays high beyond 500k steps, consider increasing the turning_penalty_coeff or idle_penalty to push the agent toward more decisive goal-seeking.