Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/lansinuote/Simple_Reinforcement_Learning/llms.txt

Use this file to discover all available pages before exploring further.

DDPG produces deterministic policies that can be brittle: once training converges the policy commits to a single action for each state, losing all exploratory behaviour. Soft Actor-Critic (SAC) addresses this by augmenting the reward with an entropy bonus. The agent is incentivised to be as random as possible while still maximising reward, which encourages exploration, robustness to model errors, and often faster convergence. SAC also employs two Q-networks (twin critics) whose minimum is used to compute targets, reducing the systematic overestimation that plagues single-critic methods.

Maximum Entropy Framework

The SAC objective is:
J(π) = E [ Σ_t ( r_t + α · H(π(·|s_t)) ) ]
where H(π) = -E[log π(a|s)] is the entropy of the policy. The temperature parameter α controls the trade-off between reward maximisation and entropy maximisation. In full SAC, α is itself a learnable parameter automatically tuned to hit a target entropy level.

Architecture Overview

1

Stochastic actor

Outputs mean μ and std σ of a Gaussian. Actions are sampled via the reparameterisation trick a = tanh(μ + ε·σ) * 2, keeping gradients differentiable. The entropy uses a corrected log-probability that accounts for the tanh squashing.
2

Twin critics

Two separate Q-networks Q1(s,a) and Q2(s,a). The target uses min(Q1, Q2) to counteract overestimation bias.
3

Target networks

Each critic has a corresponding target network updated via soft update (τ=0.005).
4

Automatic temperature tuning

α = exp(log_α) is a learnable scalar optimised so that the current entropy stays near the target entropy -dim(A).

Environment (Pendulum-v1)

Pendulum-v1 has a 3-dimensional state and a 1-dimensional continuous action in [-2, 2].
import gym

class MyWrapper(gym.Wrapper):
    def __init__(self):
        env = gym.make('Pendulum-v1', render_mode='rgb_array')
        super().__init__(env)
        self.env = env
        self.step_n = 0

    def reset(self):
        state, _ = self.env.reset()
        self.step_n = 0
        return state

    def step(self, action):
        state, reward, terminated, truncated, info = self.env.step(action)
        done = terminated or truncated
        self.step_n += 1
        if self.step_n >= 200:
            done = True
        return state, reward, done, info


env = MyWrapper()
env.reset()

Actor Network (Gaussian Policy with Reparameterisation)

The actor outputs (action, entropy). The action is sampled using the reparameterisation trick so that gradients flow through the sampling operation. Because the raw sample is passed through tanh, the entropy must be corrected to account for the squashing transformation.
import torch

class ModelAction(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.fc_state = torch.nn.Sequential(
            torch.nn.Linear(3, 128),
            torch.nn.ReLU(),
        )
        self.fc_mu  = torch.nn.Linear(128, 1)
        self.fc_std = torch.nn.Sequential(
            torch.nn.Linear(128, 1),
            torch.nn.Softplus(),
        )

    def forward(self, state):
        # [b, 3] → [b, 128]
        state = self.fc_state(state)

        # [b, 128] → [b, 1]
        mu  = self.fc_mu(state)

        # [b, 128] → [b, 1]  (std > 0 via Softplus)
        std = self.fc_std(state)

        # Define the distribution and reparameterise
        dist   = torch.distributions.Normal(mu, std)
        sample = dist.rsample()

        # Squash sample to (-1, 1), then scale to action range [-2, 2]
        action = torch.tanh(sample)

        # Entropy with tanh-squashing correction:
        #   entropy = -(log_prob(sample) - log(1 - tanh(sample)^2 + ε))
        log_prob = dist.log_prob(sample)
        entropy  = log_prob - (1 - action.tanh()**2 + 1e-7).log()
        entropy  = -entropy

        return action * 2, entropy


model_action = ModelAction()

model_action(torch.randn(2, 3))

Twin Critic Networks

class ModelValue(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.sequential = torch.nn.Sequential(
            torch.nn.Linear(4, 128),
            torch.nn.ReLU(),
            torch.nn.Linear(128, 128),
            torch.nn.ReLU(),
            torch.nn.Linear(128, 1),
        )

    def forward(self, state, action):
        # Concatenate state and action: [b, 3+1] → [b, 4]
        state = torch.cat([state, action], dim=1)
        # [b, 4] → [b, 1]
        return self.sequential(state)


# Twin critics + their target copies
model_value1 = ModelValue()
model_value2 = ModelValue()

model_value_next1 = ModelValue()
model_value_next2 = ModelValue()

model_value_next1.load_state_dict(model_value1.state_dict())
model_value_next2.load_state_dict(model_value2.state_dict())

model_value1(torch.randn(2, 3), torch.randn(2, 1))

Automatic Temperature Parameter

import math

# alpha is stored in log-space for numerical stability;
# alpha.exp() recovers the actual temperature
alpha = torch.tensor(math.log(0.01))
alpha.requires_grad = True

alpha

Experience Replay Buffer

import random

# Replay buffer (capacity 100 000)
datas = []

def get_action(state):
    state = torch.FloatTensor(state).reshape(1, 3)
    action, _ = model_action(state)
    return action.item()

def update_data():
    state = env.reset()
    over  = False

    while not over:
        action = get_action(state)
        next_state, reward, over, _ = env.step([action])
        datas.append((state, action, reward, next_state, over))
        state = next_state

    # Evict oldest samples beyond capacity
    while len(datas) > 100000:
        datas.pop(0)

def get_sample():
    samples = random.sample(datas, 64)

    state      = torch.FloatTensor([i[0] for i in samples]).reshape(-1, 3)
    action     = torch.FloatTensor([i[1] for i in samples]).reshape(-1, 1)
    reward     = torch.FloatTensor([i[2] for i in samples]).reshape(-1, 1)
    next_state = torch.FloatTensor([i[3] for i in samples]).reshape(-1, 3)
    over       = torch.LongTensor( [i[4] for i in samples]).reshape(-1, 1)

    return state, action, reward, next_state, over

Target Computation

The target includes the entropy bonus from the next state:
def get_target(reward, next_state, over):
    # Sample action and entropy for next_state
    action, entropy = model_action(next_state)  # [b, 1], [b, 1]

    # Min of two target critics
    target1 = model_value_next1(next_state, action)
    target2 = model_value_next2(next_state, action)
    target  = torch.min(target1, target2)

    # Entropy-augmented target: Q_target + α · H
    target  += alpha.exp() * entropy

    target  *= 0.99
    target  *= (1 - over)
    target  += reward
    return target

Actor Loss

def get_loss_action(state):
    action, entropy = model_action(state)   # [b, 1], [b, 1]

    value1 = model_value1(state, action)
    value2 = model_value2(state, action)
    value  = torch.min(value1, value2)

    # Maximise (Q + α·H) → minimise -(α·H + Q)
    loss_action  = -alpha.exp() * entropy
    loss_action -= value

    return loss_action.mean(), entropy

Soft Update and Training Loop

def soft_update(model, model_next):
    for param, param_next in zip(model.parameters(), model_next.parameters()):
        # θ_target ← 0.995 · θ_target + 0.005 · θ
        value = param_next.data * 0.995 + param.data * 0.005
        param_next.data.copy_(value)


def train():
    optimizer_action = torch.optim.Adam(model_action.parameters(),  lr=3e-4)
    optimizer_value1 = torch.optim.Adam(model_value1.parameters(),  lr=3e-3)
    optimizer_value2 = torch.optim.Adam(model_value2.parameters(),  lr=3e-3)
    optimizer_alpha  = torch.optim.Adam([alpha], lr=3e-4)
    loss_fn = torch.nn.MSELoss()

    for epoch in range(100):
        update_data()

        for i in range(200):
            state, action, reward, next_state, over = get_sample()

            # Normalise rewards (Pendulum rewards are in [-16, 0])
            reward = (reward + 8) / 8

            target = get_target(reward, next_state, over).detach()

            # ── Critic 1 update ───────────────────────────────────────
            loss_value1 = loss_fn(model_value1(state, action), target)
            optimizer_value1.zero_grad()
            loss_value1.backward()
            optimizer_value1.step()

            # ── Critic 2 update ───────────────────────────────────────
            loss_value2 = loss_fn(model_value2(state, action), target)
            optimizer_value2.zero_grad()
            loss_value2.backward()
            optimizer_value2.step()

            # ── Actor update ──────────────────────────────────────────
            loss_action, entropy = get_loss_action(state)
            optimizer_action.zero_grad()
            loss_action.backward()
            optimizer_action.step()

            # ── Alpha (temperature) update ────────────────────────────
            # Drive entropy toward target = -1 (= -dim(action) for Pendulum)
            loss_alpha = (entropy + 1).detach() * alpha.exp()
            loss_alpha = loss_alpha.mean()
            optimizer_alpha.zero_grad()
            loss_alpha.backward()
            optimizer_alpha.step()

            # ── Soft update target networks ───────────────────────────
            soft_update(model_value1, model_value_next1)
            soft_update(model_value2, model_value_next2)

        if epoch % 10 == 0:
            print(epoch, len(datas), alpha.exp().item(),
                  sum([test(play=False) for _ in range(10)]) / 10)

train()

SAC vs DDPG

FeatureDDPGSAC
Policy typeDeterministicStochastic (maximum entropy)
ExplorationManual Gaussian noiseIntrinsic via entropy bonus
CriticsSingle Q-networkTwin Q-networks (min target)
TemperatureN/AAutomatic tuning
Sample efficiencyModerateHigh
The automatic alpha tuning is a key advantage of full SAC. The target entropy is set to -dim(A) — for Pendulum that is -1. If the current entropy is below this target, the alpha update increases alpha to encourage more exploration; if above, alpha decreases to focus the policy.
Twin critics add computational overhead but are critical for stability. Using a single critic in SAC often leads to aggressive overestimation, causing the actor to exploit erroneous Q-values and destabilise training.

Build docs developers (and LLMs) love