Documentation Index
Fetch the complete documentation index at: https://mintlify.com/lansinuote/Simple_Reinforcement_Learning/llms.txt
Use this file to discover all available pages before exploring further.
Deep Q-Networks (DQN) bridge the gap between classical Q-learning and modern deep learning by using a neural network to approximate the action-value function Q(s, a). Rather than storing Q-values in a table, the network generalises across the continuous state space. Two critical inventions make training stable: an experience replay buffer that breaks temporal correlations in the data, and a periodically-frozen target network that provides stable regression targets. This page walks through the single-model CartPole baseline, then shows how Double DQN and Dueling DQN improve on the original design using the Pendulum environment.
Environment Setup
The single-model DQN notebook wraps CartPole in a MyWrapper class that normalises the step return signature and enforces a 200-step episode limit.
import gym
class MyWrapper(gym.Wrapper):
def __init__(self):
env = gym.make('CartPole-v1', render_mode='rgb_array')
super().__init__(env)
self.env = env
self.step_n = 0
def reset(self):
state, _ = self.env.reset()
self.step_n = 0
return state
def step(self, action):
state, reward, terminated, truncated, info = self.env.step(action)
done = terminated or truncated
self.step_n += 1
if self.step_n >= 200:
done = True
return state, reward, done, info
env = MyWrapper()
env.reset()
CartPole-v1 has a 4-dimensional state (position, velocity, angle, angular velocity) and 2 discrete actions (push left / push right).
Key Concepts
Q-Network
A neural network maps state → Q-values for every action. The agent picks the action with the highest Q-value (greedy), or explores randomly with probability ε (epsilon-greedy).
Experience Replay
Transitions (s, a, r, s', done) are stored in a Python list called datas. At each training step a random mini-batch is drawn from this buffer, breaking harmful temporal correlations.
Target Network
A second network with frozen weights provides the TD targets. Its weights are periodically hard-copied from the online network, preventing the “chasing a moving target” instability.
TD Update
The loss is MSE(Q(s,a), r + γ·max_a' Q_target(s',a')). Minimising this loss nudges the online network toward the Bellman optimality equation.
Single-Model DQN (Baseline)
The simplest variant uses one network for both action selection and target computation.
Network Architecture
import torch
# Q-network: state (4) → hidden (128) → Q-values (2)
model = torch.nn.Sequential(
torch.nn.Linear(4, 128),
torch.nn.ReLU(),
torch.nn.Linear(128, 2),
)
Epsilon-Greedy Action Selection
import random
def get_action(state):
if random.random() < 0.01:
return random.choice([0, 1])
# Forward pass through the network
state = torch.FloatTensor(state).reshape(1, 4)
return model(state).argmax().item()
The exploration rate here is a fixed 1 %. In practice you would anneal ε from a high value (e.g. 1.0) down to a small floor during training.
Experience Replay Buffer
# Replay buffer — a plain Python list
datas = []
def update_data():
old_count = len(datas)
# Collect at least 200 new transitions
while len(datas) - old_count < 200:
state = env.reset()
over = False
while not over:
action = get_action(state)
next_state, reward, over, _ = env.step(action)
datas.append((state, action, reward, next_state, over))
state = next_state
update_count = len(datas) - old_count
drop_count = max(len(datas) - 10000, 0)
# Evict oldest samples beyond the 10 000 cap
while len(datas) > 10000:
datas.pop(0)
return update_count, drop_count
Sampling a Mini-Batch
def get_sample():
samples = random.sample(datas, 64)
state = torch.FloatTensor([i[0] for i in samples]) # [b, 4]
action = torch.LongTensor( [i[1] for i in samples]) # [b]
reward = torch.FloatTensor([i[2] for i in samples]) # [b]
next_state = torch.FloatTensor([i[3] for i in samples]) # [b, 4]
over = torch.LongTensor( [i[4] for i in samples]) # [b]
return state, action, reward, next_state, over
Computing Q-Values and TD Targets
def get_value(state, action):
# [b, 4] → [b, 2] → [b]
value = model(state)
value = value[range(64), action]
return value
def get_target(reward, next_state, over):
with torch.no_grad():
target = model(next_state) # [b, 4] → [b, 2]
target = target.max(dim=1)[0] # [b]
# Zero out terminal states
for i in range(64):
if over[i]:
target[i] = 0
target *= 0.98 # discount factor γ
target += reward
return target
Training Loop
def train():
model.train()
optimizer = torch.optim.Adam(model.parameters(), lr=2e-3)
loss_fn = torch.nn.MSELoss()
for epoch in range(500):
update_count, drop_count = update_data()
for i in range(200):
state, action, reward, next_state, over = get_sample()
value = get_value(state, action)
target = get_target(reward, next_state, over)
loss = loss_fn(value, target)
optimizer.zero_grad()
loss.backward()
optimizer.step()
if epoch % 50 == 0:
print(epoch, len(datas), update_count, drop_count,
sum([test(play=False) for _ in range(20)]) / 20)
train()
Double DQN
The Double DQN and Dueling DQN variants switch to the Pendulum-v1 environment. Pendulum has a 3-dimensional state and a continuous torque action. To use DQN (which requires a discrete action space), the continuous action range [-2, 2] is discretised into 11 bins.
import gym
class MyWrapper(gym.Wrapper):
def __init__(self):
env = gym.make('Pendulum-v1', render_mode='rgb_array')
super().__init__(env)
self.env = env
self.step_n = 0
def reset(self):
state, _ = self.env.reset()
self.step_n = 0
return state
def step(self, action):
state, reward, terminated, truncated, info = self.env.step(action)
done = terminated or truncated
self.step_n += 1
if self.step_n >= 200:
done = True
return state, reward, done, info
env = MyWrapper()
env.reset()
Vanilla DQN tends to overestimate Q-values because the same network both selects and evaluates actions. Double DQN decouples these two decisions:
- Action selection → online network (
model)
- Action evaluation → target network (
next_model)
Both networks have input dimension 3 (Pendulum state) and output 11 (discretised actions).
import torch
model = torch.nn.Sequential(
torch.nn.Linear(3, 128),
torch.nn.ReLU(),
torch.nn.Linear(128, 11),
)
next_model = torch.nn.Sequential(
torch.nn.Linear(3, 128),
torch.nn.ReLU(),
torch.nn.Linear(128, 11),
)
next_model.load_state_dict(model.state_dict())
Actions are selected by the online network, then mapped back to a continuous value for the environment:
import random
def get_action(state):
state = torch.FloatTensor(state).reshape(1, 3)
action = model(state).argmax().item()
if random.random() < 0.01:
action = random.choice(range(11))
# Map discrete bin index to continuous action in [-2, 2]
action_continuous = action / 10 * 4 - 2
return action, action_continuous
The key difference from standard DQN is in get_target: the online network selects the best next action, but the target network evaluates its value.
def get_target(reward, next_state, over):
with torch.no_grad():
target = next_model(next_state) # target net: [b, 11]
# Double DQN: online net selects the best action index
with torch.no_grad():
model_target = model(next_state) # online net: [b, 11]
best_actions = model_target.max(dim=1)[1].reshape(-1, 1)
# Target net evaluates the value at that action
target = target.gather(dim=1, index=best_actions)
target *= 0.98
target *= (1 - over)
target += reward
return target
The target network is hard-copied every 50 inner steps:
if (i + 1) % 50 == 0:
next_model.load_state_dict(model.state_dict())
Dueling DQN
Dueling DQN factorises the Q-value into a state-value stream V(s) and an advantage stream A(s, a) via a custom network architecture. This helps the network learn the baseline value of a state independently of which action is taken.
The VAnet class below also operates on the Pendulum state (dim 3, 11 actions):
import torch
class VAnet(torch.nn.Module):
def __init__(self):
super().__init__()
# Shared feature extractor
self.fc = torch.nn.Sequential(
torch.nn.Linear(3, 128),
torch.nn.ReLU(),
)
self.fc_A = torch.nn.Linear(128, 11) # Advantage stream
self.fc_V = torch.nn.Linear(128, 1) # Value stream
def forward(self, x):
A = self.fc_A(self.fc(x)) # [b, 11]
V = self.fc_V(self.fc(x)) # [b, 1]
# Centre advantages so V and A are identifiable
A_mean = A.mean(dim=1).reshape(-1, 1) # [b, 1]
A -= A_mean
# Q = V + (A - mean(A))
return A + V
# Online and target networks share the same VAnet architecture
model = VAnet()
next_model = VAnet()
next_model.load_state_dict(model.state_dict())
Subtracting the mean advantage (A -= A_mean) ensures identifiability: without this trick V and A are not uniquely recoverable from Q.
Comparison
| Variant | Environment | Key Difference | Benefit |
|---|
| DQN (single model) | CartPole-v1 | One network for selection and evaluation | Simple baseline |
| Double DQN | Pendulum-v1 (discretised) | Online net selects, target net evaluates | Reduces overestimation |
| Dueling DQN | Pendulum-v1 (discretised) | Separate V and A streams | Better state-value estimation |
All examples here use a hard copy of weights into the target network (next_model.load_state_dict(model.state_dict())). Production implementations often prefer a soft update (θ_target ← τ θ + (1-τ) θ_target with small τ) as used in DDPG and SAC.