One of the most powerful ideas in reinforcement learning is to make the most of every real interaction with the environment by also learning a model of how the environment works, then using that model to generate additional synthetic experience for free. DynaQ, introduced by Sutton (1990), does exactly this: it combines standard model-free Q-Learning with a learned world model so that each real transition triggers not just one Q-update but K additional planning updates replayed from the model’s stored history. On the same cliff-walking task that takes plain Q-Learning 1,500 episodes to solve, DynaQ converges in around 300 episodes with K = 20 planning steps per real transition.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/lansinuote/Simple_Reinforcement_Learning/llms.txt
Use this file to discover all available pages before exploring further.
Environment Setup
DynaQ runs on the same 4×12 cliff-walking grid used throughout this series. The grid has ground cells, trap cells along the bottom (reward −100), and a terminal goal at the bottom-right (reward −1). Cells outside the bottom row yield a reward of −1 per step.The Three DynaQ Components
DynaQ unifies three processes that run in tight coordination every time step:- Direct RL — take a real action, observe a real transition, and update Q immediately with Q-Learning.
- Model learning — store the observed transition in a dictionary keyed by
(state, action)so it can be replayed later. - Planning — sample K random previously-seen transitions from the model and perform K additional Q-updates on them, amplifying learning without any new real interactions.
Data Structures
history dictionary is the agent’s internal model of the world. After observing that taking action a in state (r, c) led to (r', c') with reward rwd, the agent stores:
The DynaQ Algorithm
Choose an action with ε-greedy exploration
At each step the agent consults the Q-table: with probability 0.1 it picks a random action, otherwise it picks the action with the highest Q-value.
Execute the action and compute the Q-Learning update
The direct RL update follows the standard Q-Learning rule: the target is built from the maximum Q-value at the next state, making it off-policy.
Update the world model
Every observed transition is stored in the
history dictionary. Because the environment is deterministic, each (state, action) key maps to exactly one outcome.Plan: replay K transitions from the model
The planning loop draws K = 20 random entries from
history and applies the same Q-Learning update to each. These are purely simulated — no real environment steps occur. This is the core of DynaQ: cheap simulated experience amplifies learning from each costly real interaction.Train the full DynaQ loop
The complete training loop ties all four steps together. Notice that Notice how quickly the cumulative reward drops: by episode 40 the agent is already near-optimal. Plain Q-Learning typically needs hundreds more episodes to reach the same performance.
q_planning() is called inside the inner while loop, so 20 planning updates fire for every single real environment step.Visualising the Learned Policy
Once training is complete, print the greedy action at every grid position to see the policy the agent has discovered:Why Planning Accelerates Learning
Each real environment step is expensive: it requires an actual interaction. Each planning step costs only a dictionary lookup and an arithmetic update. With K = 20, every single real transition propagates reward information through 21 updates (1 direct + 20 simulated). This is especially powerful early in training when the Q-table is sparse — planning recycled transitions from the first few episodes rapidly fill in Q-values that would otherwise take many more real interactions to discover.Increasing K further speeds up convergence but increases computation per real step. When the model contains errors (e.g., in stochastic environments), aggressive planning can also propagate those errors — in such cases a smaller K or periodic model validation is advisable.
DynaQ vs Plain Q-Learning
| Property | Q-Learning | DynaQ |
|---|---|---|
| Update source | Real transitions only | Real transitions + simulated replays |
| Episodes to converge | ~1,000–1,500 | ~40–60 |
| Updates per real step | 1 | 1 + K (K = 20 here) |
| Memory required | Q-table only | Q-table + history model |
| Works in stochastic envs | Yes | Yes (model must capture distribution) |
| Planning overhead | None | O(K) dictionary lookups per step |