Value-based methods such as DQN learn a Q-function and derive a policy implicitly by acting greedily with respect to it. Policy gradient methods take a more direct approach: they parameterise the policy itself as a neural network and optimise it by following the gradient of the expected return. The REINFORCE algorithm is the simplest member of this family. It collects a complete episode using the current policy, computes the discounted return at each time step, and then updates the network parameters so that actions that led to high returns become more probable.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/lansinuote/Simple_Reinforcement_Learning/llms.txt
Use this file to discover all available pages before exploring further.
Algorithm Overview
Collect an episode
Run the current policy π(a|s;θ) until the episode ends. Record states, actions, and rewards for every step.
Compute discounted returns
Work backwards from the final step:
G_t = r_t + γ·G_{t+1}. Accumulate from the end so only one backward pass through the rewards is needed.Compute policy gradient loss
For each step, loss =
-log π(a_t|s_t) · G_t. The negative sign converts the gradient ascent objective into a gradient descent loss.Environment
Policy Network
REINFORCE requires a stochastic policy that outputs a probability distribution over actions. Adding aSoftmax layer at the output ensures the network produces valid probabilities.
Action Selection
Unlike epsilon-greedy, the policy is sampled stochastically from the probability distribution output by the network.random.choices weights the selection by the action probabilities.
Data Collection
REINFORCE collects a complete episode before updating — it is a Monte Carlo method.Training Loop
The returnG_t is computed online inside the training loop by iterating backwards through the episode. The gradient is accumulated for every step before a single optimizer update.
retain_graph=True is needed because the same computational graph is reused for every time step in the episode. Without it PyTorch would free the graph after the first backward() call and the subsequent ones would fail.Key Equations
The REINFORCE gradient estimator is:G_t = Σ_{k=t}^{T} γ^{k-t} r_k is the discounted return from step t.
In code this maps to:
Comparison with DQN
- Policy Gradient (REINFORCE)
- DQN
- On-policy: uses data collected by the current policy only.
- Stochastic policy: outputs a probability distribution; action is sampled via
random.choices. - Monte Carlo returns: requires a full episode before any update.
- Directly optimises the policy objective.
- High variance due to Monte Carlo estimation of the return.