Reinforcement learning (RL) is fundamentally different from supervised learning: instead of labelled examples, an agent learns by interacting with an environment, taking actions, and receiving scalar reward signals. Chapter 18 introduces the core RL concepts — environments, observations, actions, policies, and returns — and then implements two foundational algorithms: REINFORCE (policy gradients) and Deep Q-Network (DQN), complete with a replay buffer and target network. All experiments use the Farama Foundation’s Gymnasium library (the successor to OpenAI Gym).Documentation Index
Fetch the complete documentation index at: https://mintlify.com/ageron/handson-ml3/llms.txt
Use this file to discover all available pages before exploring further.
What you’ll learn
- The RL framework: environments, observations, actions, rewards, episodes
- OpenAI Gymnasium API:
make(),reset(),step(),render() - CartPole and Atari Breakout as benchmark environments
- Policy gradient methods: the REINFORCE algorithm
- Estimating policy gradients and computing discounted returns
- Deep Q-Networks (DQN): the Q-value function and the Bellman equation
- Experience replay buffer to break sample correlation
- Target network to stabilise Q-learning targets
- ε-greedy exploration and ε decay schedules
- Overview of state-of-the-art algorithms: DQN variants, PPO, SAC
Key concepts
The Gymnasium API
Gymnasium provides a standardised interface for a wide variety of environments:env.reset(seed=42)resets the environment and returns the initial observation and an info dict.env.step(action)applies the action and returns(observation, reward, terminated, truncated, info).env.render()returns an RGB array whenrender_mode="rgb_array"is set.env.action_spaceandenv.observation_spacedescribe the shape of valid actions and observations.
[cart position, cart velocity, pole angle, pole angular velocity] and its action space is discrete with two choices: push left (0) or push right (1). The episode ends when the pole falls over or the cart leaves the track.
Policy gradient methods (REINFORCE)
In REINFORCE you represent the policy as a neural network that maps observations to action probabilities. At each training step you:- Run several episodes using the current policy.
- For each step in each episode, compute the discounted return — the sum of future rewards, exponentially discounted by factor γ.
- Update the policy network by gradient ascent on the log-probability of taken actions, weighted by the discounted return.
Deep Q-Networks
DQN learns a Q-functionQ(s, a) — the expected discounted return when taking action a in state s and following the optimal policy thereafter. A neural network estimates Q-values for all actions simultaneously. The training target for action a in state s is r + γ * max_a' Q(s', a') (Bellman equation), where the target network provides Q(s', a') and is updated slowly (every few thousand steps) to stabilise training.
Two key techniques:
- Replay buffer: transitions
(s, a, r, s', done)are stored and sampled randomly for training, breaking temporal correlations. - Target network: a separate copy of the Q-network whose parameters are held fixed for a number of steps, preventing the bootstrapping target from chasing itself.
Code examples
Creating and exploring a Gymnasium environment
Simple policy: always push left or right based on pole angle
DQN Q-network architecture
Replay buffer and training step sketch
Running this notebook
Install Gymnasium
On Colab, the notebook automatically removes the old
gym package and installs gymnasium with Box2D support:Open in Colab
Keras 2 compatibility
This chapter sets
TF_USE_LEGACY_KERAS=1 before importing TensorFlow. This must be set before the first import tensorflow call.Exercises
Exercises include implementing the REINFORCE algorithm from scratch, adding a baseline to reduce gradient variance, and training a DQN to play Atari Breakout using pixel observations. Solutions are in the notebook.Gymnasium is the Farama Foundation’s maintained fork of the original OpenAI Gym. It is a drop-in replacement:
import gymnasium as gym and everything works identically. The notebook imports gymnasium directly.