Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/lansinuote/Simple_Reinforcement_Learning/llms.txt

Use this file to discover all available pages before exploring further.

Deep Q-Networks (DQN) bridge the gap between classical Q-learning and modern deep learning by using a neural network to approximate the action-value function Q(s, a). Rather than storing Q-values in a table, the network generalises across the continuous state space. Two critical inventions make training stable: an experience replay buffer that breaks temporal correlations in the data, and a periodically-frozen target network that provides stable regression targets. This page walks through the single-model CartPole baseline, then shows how Double DQN and Dueling DQN improve on the original design using the Pendulum environment.

Environment Setup

The single-model DQN notebook wraps CartPole in a MyWrapper class that normalises the step return signature and enforces a 200-step episode limit.
import gym

class MyWrapper(gym.Wrapper):

    def __init__(self):
        env = gym.make('CartPole-v1', render_mode='rgb_array')
        super().__init__(env)
        self.env = env
        self.step_n = 0

    def reset(self):
        state, _ = self.env.reset()
        self.step_n = 0
        return state

    def step(self, action):
        state, reward, terminated, truncated, info = self.env.step(action)
        done = terminated or truncated
        self.step_n += 1
        if self.step_n >= 200:
            done = True
        return state, reward, done, info


env = MyWrapper()
env.reset()
CartPole-v1 has a 4-dimensional state (position, velocity, angle, angular velocity) and 2 discrete actions (push left / push right).

Key Concepts

1

Q-Network

A neural network maps state → Q-values for every action. The agent picks the action with the highest Q-value (greedy), or explores randomly with probability ε (epsilon-greedy).
2

Experience Replay

Transitions (s, a, r, s', done) are stored in a Python list called datas. At each training step a random mini-batch is drawn from this buffer, breaking harmful temporal correlations.
3

Target Network

A second network with frozen weights provides the TD targets. Its weights are periodically hard-copied from the online network, preventing the “chasing a moving target” instability.
4

TD Update

The loss is MSE(Q(s,a), r + γ·max_a' Q_target(s',a')). Minimising this loss nudges the online network toward the Bellman optimality equation.

Single-Model DQN (Baseline)

The simplest variant uses one network for both action selection and target computation.

Network Architecture

import torch

# Q-network: state (4) → hidden (128) → Q-values (2)
model = torch.nn.Sequential(
    torch.nn.Linear(4, 128),
    torch.nn.ReLU(),
    torch.nn.Linear(128, 2),
)

Epsilon-Greedy Action Selection

import random

def get_action(state):
    if random.random() < 0.01:
        return random.choice([0, 1])

    # Forward pass through the network
    state = torch.FloatTensor(state).reshape(1, 4)
    return model(state).argmax().item()
The exploration rate here is a fixed 1 %. In practice you would anneal ε from a high value (e.g. 1.0) down to a small floor during training.

Experience Replay Buffer

# Replay buffer — a plain Python list
datas = []

def update_data():
    old_count = len(datas)

    # Collect at least 200 new transitions
    while len(datas) - old_count < 200:
        state = env.reset()
        over = False
        while not over:
            action = get_action(state)
            next_state, reward, over, _ = env.step(action)
            datas.append((state, action, reward, next_state, over))
            state = next_state

    update_count = len(datas) - old_count
    drop_count = max(len(datas) - 10000, 0)

    # Evict oldest samples beyond the 10 000 cap
    while len(datas) > 10000:
        datas.pop(0)

    return update_count, drop_count

Sampling a Mini-Batch

def get_sample():
    samples = random.sample(datas, 64)

    state      = torch.FloatTensor([i[0] for i in samples])   # [b, 4]
    action     = torch.LongTensor( [i[1] for i in samples])   # [b]
    reward     = torch.FloatTensor([i[2] for i in samples])   # [b]
    next_state = torch.FloatTensor([i[3] for i in samples])   # [b, 4]
    over       = torch.LongTensor( [i[4] for i in samples])   # [b]

    return state, action, reward, next_state, over

Computing Q-Values and TD Targets

def get_value(state, action):
    # [b, 4] → [b, 2] → [b]
    value = model(state)
    value = value[range(64), action]
    return value

def get_target(reward, next_state, over):
    with torch.no_grad():
        target = model(next_state)          # [b, 4] → [b, 2]

    target = target.max(dim=1)[0]           # [b]

    # Zero out terminal states
    for i in range(64):
        if over[i]:
            target[i] = 0

    target *= 0.98      # discount factor γ
    target += reward
    return target

Training Loop

def train():
    model.train()
    optimizer = torch.optim.Adam(model.parameters(), lr=2e-3)
    loss_fn   = torch.nn.MSELoss()

    for epoch in range(500):
        update_count, drop_count = update_data()

        for i in range(200):
            state, action, reward, next_state, over = get_sample()

            value  = get_value(state, action)
            target = get_target(reward, next_state, over)

            loss = loss_fn(value, target)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        if epoch % 50 == 0:
            print(epoch, len(datas), update_count, drop_count,
                  sum([test(play=False) for _ in range(20)]) / 20)

train()

Double DQN

The Double DQN and Dueling DQN variants switch to the Pendulum-v1 environment. Pendulum has a 3-dimensional state and a continuous torque action. To use DQN (which requires a discrete action space), the continuous action range [-2, 2] is discretised into 11 bins.
import gym

class MyWrapper(gym.Wrapper):
    def __init__(self):
        env = gym.make('Pendulum-v1', render_mode='rgb_array')
        super().__init__(env)
        self.env = env
        self.step_n = 0

    def reset(self):
        state, _ = self.env.reset()
        self.step_n = 0
        return state

    def step(self, action):
        state, reward, terminated, truncated, info = self.env.step(action)
        done = terminated or truncated
        self.step_n += 1
        if self.step_n >= 200:
            done = True
        return state, reward, done, info


env = MyWrapper()
env.reset()
Vanilla DQN tends to overestimate Q-values because the same network both selects and evaluates actions. Double DQN decouples these two decisions:
  • Action selection → online network (model)
  • Action evaluation → target network (next_model)
Both networks have input dimension 3 (Pendulum state) and output 11 (discretised actions).
import torch

model = torch.nn.Sequential(
    torch.nn.Linear(3, 128),
    torch.nn.ReLU(),
    torch.nn.Linear(128, 11),
)

next_model = torch.nn.Sequential(
    torch.nn.Linear(3, 128),
    torch.nn.ReLU(),
    torch.nn.Linear(128, 11),
)
next_model.load_state_dict(model.state_dict())
Actions are selected by the online network, then mapped back to a continuous value for the environment:
import random

def get_action(state):
    state = torch.FloatTensor(state).reshape(1, 3)
    action = model(state).argmax().item()

    if random.random() < 0.01:
        action = random.choice(range(11))

    # Map discrete bin index to continuous action in [-2, 2]
    action_continuous = action / 10 * 4 - 2
    return action, action_continuous
The key difference from standard DQN is in get_target: the online network selects the best next action, but the target network evaluates its value.
def get_target(reward, next_state, over):
    with torch.no_grad():
        target = next_model(next_state)     # target net: [b, 11]

    # Double DQN: online net selects the best action index
    with torch.no_grad():
        model_target = model(next_state)    # online net: [b, 11]

    best_actions = model_target.max(dim=1)[1].reshape(-1, 1)

    # Target net evaluates the value at that action
    target = target.gather(dim=1, index=best_actions)

    target *= 0.98
    target *= (1 - over)
    target += reward
    return target
The target network is hard-copied every 50 inner steps:
if (i + 1) % 50 == 0:
    next_model.load_state_dict(model.state_dict())

Dueling DQN

Dueling DQN factorises the Q-value into a state-value stream V(s) and an advantage stream A(s, a) via a custom network architecture. This helps the network learn the baseline value of a state independently of which action is taken. The VAnet class below also operates on the Pendulum state (dim 3, 11 actions):
import torch

class VAnet(torch.nn.Module):
    def __init__(self):
        super().__init__()

        # Shared feature extractor
        self.fc = torch.nn.Sequential(
            torch.nn.Linear(3, 128),
            torch.nn.ReLU(),
        )

        self.fc_A = torch.nn.Linear(128, 11)   # Advantage stream
        self.fc_V = torch.nn.Linear(128, 1)    # Value stream

    def forward(self, x):
        A = self.fc_A(self.fc(x))              # [b, 11]
        V = self.fc_V(self.fc(x))              # [b, 1]

        # Centre advantages so V and A are identifiable
        A_mean = A.mean(dim=1).reshape(-1, 1)  # [b, 1]
        A -= A_mean

        # Q = V + (A - mean(A))
        return A + V
# Online and target networks share the same VAnet architecture
model      = VAnet()
next_model = VAnet()
next_model.load_state_dict(model.state_dict())
Subtracting the mean advantage (A -= A_mean) ensures identifiability: without this trick V and A are not uniquely recoverable from Q.

Comparison

VariantEnvironmentKey DifferenceBenefit
DQN (single model)CartPole-v1One network for selection and evaluationSimple baseline
Double DQNPendulum-v1 (discretised)Online net selects, target net evaluatesReduces overestimation
Dueling DQNPendulum-v1 (discretised)Separate V and A streamsBetter state-value estimation
All examples here use a hard copy of weights into the target network (next_model.load_state_dict(model.state_dict())). Production implementations often prefer a soft update (θ_target ← τ θ + (1-τ) θ_target with small τ) as used in DDPG and SAC.

Build docs developers (and LLMs) love