Dynamic Programming (DP) algorithms find the optimal policy for an MDP when the full model — the transition probabilitiesDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/lansinuote/Simple_Reinforcement_Learning/llms.txt
Use this file to discover all available pages before exploring further.
P and reward function R — is known. Two classic algorithms are Policy Iteration and Value Iteration.
Assumptions
- The state space S and action space A are finite and enumerable.
- The transition model
P(s′ | s, a)and rewardR(s, a, s′)are fully known. - We seek the optimal value function
V*(s)and the optimal policyπ*(s).
CliffWalking Grid World
The first two notebooks use a4×12 grid with traps along the bottom row and a goal at the bottom-right corner.
Environment Helpers
Initialisation
Action-Value (Q) Function
Policy Iteration
Policy Iteration alternates between two steps until convergence:- Policy Evaluation — compute
V(s)under the current policyπ. - Policy Improvement — update
πgreedily with respect toV.
# 策略评估 — Policy evaluation
def get_values():
new_values = np.zeros([4, 12])
for row in range(4):
for col in range(12):
action_value = np.zeros(4)
for action in range(4):
action_value[action] = get_qsa(row, col, action)
# 每个动作的分数和它的概率相乘 — Weighted by policy probabilities
action_value *= pi[row, col]
# 最后这个格子的分数 = 所有动作分数求和
new_values[row, col] = action_value.sum()
return new_values
Update the policy to be greedy: assign equal probability to all actions that achieve the maximum Q-value.
# 策略提升 — Policy improvement
def get_pi():
new_pi = np.zeros([4, 12, 4])
for row in range(4):
for col in range(12):
action_value = np.zeros(4)
for action in range(4):
action_value[action] = get_qsa(row, col, action)
# 计算当前状态下,达到最大分数的动作有几个
count = (action_value == action_value.max()).sum()
# 让这些动作均分概率 — Share probability equally
for action in range(4):
if action_value[action] == action_value.max():
new_pi[row, col, action] = 1 / count
else:
new_pi[row, col, action] = 0
return new_pi
# 循环迭代策略评估和策略提升,寻找最优解
for _ in range(10):
for _ in range(100):
values = get_values() # evaluate current policy
pi = get_pi() # improve policy
# 打印所有格子的动作倾向
for row in range(4):
line = ''
for col in range(12):
action = pi[row, col].argmax()
action = {0: '↑', 1: '↓', 2: '←', 3: '→'}[action]
line += action
print(line)
Value Iteration
Value Iteration combines evaluation and improvement into a single update rule: the value of each state is set directly to the maximum Q-value over all actions. Theget_pi() step is identical to Policy Iteration.
FrozenLake (Gym)
The third notebook applies the same algorithms to theFrozenLake-v1 Gym environment — a 4×4 grid where the agent must reach a goal tile without falling into holes.
Setup
env.P[state][action] — a list of (probability, next_state, reward, done) tuples — so we can use DP directly.
Q-Value with Stochastic Transitions
Because Gym encodes each action as a distribution over outcomes, we weight by transition probability:Shared Evaluation & Improvement
Both Policy Iteration and Value Iteration reuse the sameget_pi(). Only get_values() differs:
Display Policy
Summary
| Algorithm | Update rule | Convergence |
|---|---|---|
| Policy Evaluation | V(s) ← Σ_a π(a|s) Q(s,a) | After many sweeps |
| Policy Iteration | Alternate evaluation + greedy improvement | Finite steps (for finite MDP) |
| Value Iteration | V(s) ← max_a Q(s,a) each sweep | Typically faster than PI |
DP requires a complete model of the environment. For most real-world tasks this is unavailable, motivating model-free methods such as Temporal-Difference learning (Sarsa, Q-Learning) covered in later notebooks.