Documentation Index
Fetch the complete documentation index at: https://mintlify.com/lansinuote/Simple_Reinforcement_Learning/llms.txt
Use this file to discover all available pages before exploring further.
DDPG produces deterministic policies that can be brittle: once training converges the policy commits to a single action for each state, losing all exploratory behaviour. Soft Actor-Critic (SAC) addresses this by augmenting the reward with an entropy bonus. The agent is incentivised to be as random as possible while still maximising reward, which encourages exploration, robustness to model errors, and often faster convergence. SAC also employs two Q-networks (twin critics) whose minimum is used to compute targets, reducing the systematic overestimation that plagues single-critic methods.
Maximum Entropy Framework
The SAC objective is:
J(π) = E [ Σ_t ( r_t + α · H(π(·|s_t)) ) ]
where H(π) = -E[log π(a|s)] is the entropy of the policy. The temperature parameter α controls the trade-off between reward maximisation and entropy maximisation. In full SAC, α is itself a learnable parameter automatically tuned to hit a target entropy level.
Architecture Overview
Stochastic actor
Outputs mean μ and std σ of a Gaussian. Actions are sampled via the reparameterisation trick a = tanh(μ + ε·σ) * 2, keeping gradients differentiable. The entropy uses a corrected log-probability that accounts for the tanh squashing.
Twin critics
Two separate Q-networks Q1(s,a) and Q2(s,a). The target uses min(Q1, Q2) to counteract overestimation bias.
Target networks
Each critic has a corresponding target network updated via soft update (τ=0.005).
Automatic temperature tuning
α = exp(log_α) is a learnable scalar optimised so that the current entropy stays near the target entropy -dim(A).
Environment (Pendulum-v1)
Pendulum-v1 has a 3-dimensional state and a 1-dimensional continuous action in [-2, 2].
import gym
class MyWrapper(gym.Wrapper):
def __init__(self):
env = gym.make('Pendulum-v1', render_mode='rgb_array')
super().__init__(env)
self.env = env
self.step_n = 0
def reset(self):
state, _ = self.env.reset()
self.step_n = 0
return state
def step(self, action):
state, reward, terminated, truncated, info = self.env.step(action)
done = terminated or truncated
self.step_n += 1
if self.step_n >= 200:
done = True
return state, reward, done, info
env = MyWrapper()
env.reset()
Actor Network (Gaussian Policy with Reparameterisation)
The actor outputs (action, entropy). The action is sampled using the reparameterisation trick so that gradients flow through the sampling operation. Because the raw sample is passed through tanh, the entropy must be corrected to account for the squashing transformation.
import torch
class ModelAction(torch.nn.Module):
def __init__(self):
super().__init__()
self.fc_state = torch.nn.Sequential(
torch.nn.Linear(3, 128),
torch.nn.ReLU(),
)
self.fc_mu = torch.nn.Linear(128, 1)
self.fc_std = torch.nn.Sequential(
torch.nn.Linear(128, 1),
torch.nn.Softplus(),
)
def forward(self, state):
# [b, 3] → [b, 128]
state = self.fc_state(state)
# [b, 128] → [b, 1]
mu = self.fc_mu(state)
# [b, 128] → [b, 1] (std > 0 via Softplus)
std = self.fc_std(state)
# Define the distribution and reparameterise
dist = torch.distributions.Normal(mu, std)
sample = dist.rsample()
# Squash sample to (-1, 1), then scale to action range [-2, 2]
action = torch.tanh(sample)
# Entropy with tanh-squashing correction:
# entropy = -(log_prob(sample) - log(1 - tanh(sample)^2 + ε))
log_prob = dist.log_prob(sample)
entropy = log_prob - (1 - action.tanh()**2 + 1e-7).log()
entropy = -entropy
return action * 2, entropy
model_action = ModelAction()
model_action(torch.randn(2, 3))
Twin Critic Networks
class ModelValue(torch.nn.Module):
def __init__(self):
super().__init__()
self.sequential = torch.nn.Sequential(
torch.nn.Linear(4, 128),
torch.nn.ReLU(),
torch.nn.Linear(128, 128),
torch.nn.ReLU(),
torch.nn.Linear(128, 1),
)
def forward(self, state, action):
# Concatenate state and action: [b, 3+1] → [b, 4]
state = torch.cat([state, action], dim=1)
# [b, 4] → [b, 1]
return self.sequential(state)
# Twin critics + their target copies
model_value1 = ModelValue()
model_value2 = ModelValue()
model_value_next1 = ModelValue()
model_value_next2 = ModelValue()
model_value_next1.load_state_dict(model_value1.state_dict())
model_value_next2.load_state_dict(model_value2.state_dict())
model_value1(torch.randn(2, 3), torch.randn(2, 1))
Automatic Temperature Parameter
import math
# alpha is stored in log-space for numerical stability;
# alpha.exp() recovers the actual temperature
alpha = torch.tensor(math.log(0.01))
alpha.requires_grad = True
alpha
Experience Replay Buffer
import random
# Replay buffer (capacity 100 000)
datas = []
def get_action(state):
state = torch.FloatTensor(state).reshape(1, 3)
action, _ = model_action(state)
return action.item()
def update_data():
state = env.reset()
over = False
while not over:
action = get_action(state)
next_state, reward, over, _ = env.step([action])
datas.append((state, action, reward, next_state, over))
state = next_state
# Evict oldest samples beyond capacity
while len(datas) > 100000:
datas.pop(0)
def get_sample():
samples = random.sample(datas, 64)
state = torch.FloatTensor([i[0] for i in samples]).reshape(-1, 3)
action = torch.FloatTensor([i[1] for i in samples]).reshape(-1, 1)
reward = torch.FloatTensor([i[2] for i in samples]).reshape(-1, 1)
next_state = torch.FloatTensor([i[3] for i in samples]).reshape(-1, 3)
over = torch.LongTensor( [i[4] for i in samples]).reshape(-1, 1)
return state, action, reward, next_state, over
Target Computation
The target includes the entropy bonus from the next state:
def get_target(reward, next_state, over):
# Sample action and entropy for next_state
action, entropy = model_action(next_state) # [b, 1], [b, 1]
# Min of two target critics
target1 = model_value_next1(next_state, action)
target2 = model_value_next2(next_state, action)
target = torch.min(target1, target2)
# Entropy-augmented target: Q_target + α · H
target += alpha.exp() * entropy
target *= 0.99
target *= (1 - over)
target += reward
return target
Actor Loss
def get_loss_action(state):
action, entropy = model_action(state) # [b, 1], [b, 1]
value1 = model_value1(state, action)
value2 = model_value2(state, action)
value = torch.min(value1, value2)
# Maximise (Q + α·H) → minimise -(α·H + Q)
loss_action = -alpha.exp() * entropy
loss_action -= value
return loss_action.mean(), entropy
Soft Update and Training Loop
def soft_update(model, model_next):
for param, param_next in zip(model.parameters(), model_next.parameters()):
# θ_target ← 0.995 · θ_target + 0.005 · θ
value = param_next.data * 0.995 + param.data * 0.005
param_next.data.copy_(value)
def train():
optimizer_action = torch.optim.Adam(model_action.parameters(), lr=3e-4)
optimizer_value1 = torch.optim.Adam(model_value1.parameters(), lr=3e-3)
optimizer_value2 = torch.optim.Adam(model_value2.parameters(), lr=3e-3)
optimizer_alpha = torch.optim.Adam([alpha], lr=3e-4)
loss_fn = torch.nn.MSELoss()
for epoch in range(100):
update_data()
for i in range(200):
state, action, reward, next_state, over = get_sample()
# Normalise rewards (Pendulum rewards are in [-16, 0])
reward = (reward + 8) / 8
target = get_target(reward, next_state, over).detach()
# ── Critic 1 update ───────────────────────────────────────
loss_value1 = loss_fn(model_value1(state, action), target)
optimizer_value1.zero_grad()
loss_value1.backward()
optimizer_value1.step()
# ── Critic 2 update ───────────────────────────────────────
loss_value2 = loss_fn(model_value2(state, action), target)
optimizer_value2.zero_grad()
loss_value2.backward()
optimizer_value2.step()
# ── Actor update ──────────────────────────────────────────
loss_action, entropy = get_loss_action(state)
optimizer_action.zero_grad()
loss_action.backward()
optimizer_action.step()
# ── Alpha (temperature) update ────────────────────────────
# Drive entropy toward target = -1 (= -dim(action) for Pendulum)
loss_alpha = (entropy + 1).detach() * alpha.exp()
loss_alpha = loss_alpha.mean()
optimizer_alpha.zero_grad()
loss_alpha.backward()
optimizer_alpha.step()
# ── Soft update target networks ───────────────────────────
soft_update(model_value1, model_value_next1)
soft_update(model_value2, model_value_next2)
if epoch % 10 == 0:
print(epoch, len(datas), alpha.exp().item(),
sum([test(play=False) for _ in range(10)]) / 10)
train()
SAC vs DDPG
| Feature | DDPG | SAC |
|---|
| Policy type | Deterministic | Stochastic (maximum entropy) |
| Exploration | Manual Gaussian noise | Intrinsic via entropy bonus |
| Critics | Single Q-network | Twin Q-networks (min target) |
| Temperature | N/A | Automatic tuning |
| Sample efficiency | Moderate | High |
The automatic alpha tuning is a key advantage of full SAC. The target entropy is set to -dim(A) — for Pendulum that is -1. If the current entropy is below this target, the alpha update increases alpha to encourage more exploration; if above, alpha decreases to focus the policy.
Twin critics add computational overhead but are critical for stability. Using a single critic in SAC often leads to aggressive overestimation, causing the actor to exploit erroneous Q-values and destabilise training.