Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/lansinuote/Simple_Reinforcement_Learning/llms.txt

Use this file to discover all available pages before exploring further.

Dynamic Programming (DP) algorithms find the optimal policy for an MDP when the full model — the transition probabilities P and reward function R — is known. Two classic algorithms are Policy Iteration and Value Iteration.

Assumptions

  • The state space S and action space A are finite and enumerable.
  • The transition model P(s′ | s, a) and reward R(s, a, s′) are fully known.
  • We seek the optimal value function V*(s) and the optimal policy π*(s).

CliffWalking Grid World

The first two notebooks use a 4×12 grid with traps along the bottom row and a goal at the bottom-right corner.

Environment Helpers

# 获取一个格子的状态 — Get the type of a cell
def get_state(row, col):
    if row != 3:
        return 'ground'
    if row == 3 and col == 0:
        return 'ground'
    if row == 3 and col == 11:
        return 'terminal'
    return 'trap'


# 在一个格子里做一个动作 — Execute an action in a cell
def move(row, col, action):
    # 如果当前已经在陷阱或者终点,则不能执行任何动作,反馈都是0
    if get_state(row, col) in ['trap', 'terminal']:
        return row, col, 0

    if action == 0: row -= 1   # ↑
    if action == 1: row += 1   # ↓
    if action == 2: col -= 1   # ←
    if action == 3: col += 1   # →

    # 不允许走到地图外面去 — Clamp to grid bounds
    row = max(0, row); row = min(3, row)
    col = max(0, col); col = min(11, col)

    # 是陷阱的话,奖励是-100,否则都是-1
    reward = -1
    if get_state(row, col) == 'trap':
        reward = -100

    return row, col, reward

Initialisation

import numpy as np

# 初始化每个格子的价值 — Zero-initialise value table
values = np.zeros([4, 12])

# 初始化每个格子下采用动作的概率 — Uniform random policy
pi = np.ones([4, 12, 4]) * 0.25

Action-Value (Q) Function

# 计算在一个状态下执行动作的分数 — Compute Q(s,a)
def get_qsa(row, col, action):
    next_row, next_col, reward = move(row, col, action)

    # 计算下一个状态的分数,0.9是折扣因子
    value = values[next_row, next_col] * 0.9

    # 如果下个状态是终点或者陷阱,则下一个状态的分数是0
    if get_state(next_row, next_col) in ['trap', 'terminal']:
        value = 0

    return value + reward

Policy Iteration

Policy Iteration alternates between two steps until convergence:
  1. Policy Evaluation — compute V(s) under the current policy π.
  2. Policy Improvement — update π greedily with respect to V.
1
Policy Evaluation
2
Sweep over all states and update each cell’s value as the probability-weighted sum of Q-values.
3
# 策略评估 — Policy evaluation
def get_values():
    new_values = np.zeros([4, 12])

    for row in range(4):
        for col in range(12):
            action_value = np.zeros(4)

            for action in range(4):
                action_value[action] = get_qsa(row, col, action)

            # 每个动作的分数和它的概率相乘 — Weighted by policy probabilities
            action_value *= pi[row, col]

            # 最后这个格子的分数 = 所有动作分数求和
            new_values[row, col] = action_value.sum()

    return new_values
4
Policy Improvement
5
Update the policy to be greedy: assign equal probability to all actions that achieve the maximum Q-value.
6
# 策略提升 — Policy improvement
def get_pi():
    new_pi = np.zeros([4, 12, 4])

    for row in range(4):
        for col in range(12):
            action_value = np.zeros(4)

            for action in range(4):
                action_value[action] = get_qsa(row, col, action)

            # 计算当前状态下,达到最大分数的动作有几个
            count = (action_value == action_value.max()).sum()

            # 让这些动作均分概率 — Share probability equally
            for action in range(4):
                if action_value[action] == action_value.max():
                    new_pi[row, col, action] = 1 / count
                else:
                    new_pi[row, col, action] = 0

    return new_pi
7
Iterate Until Convergence
8
# 循环迭代策略评估和策略提升,寻找最优解
for _ in range(10):
    for _ in range(100):
        values = get_values()   # evaluate current policy
    pi = get_pi()               # improve policy
9
Visualise the Optimal Policy
10
# 打印所有格子的动作倾向
for row in range(4):
    line = ''
    for col in range(12):
        action = pi[row, col].argmax()
        action = {0: '↑', 1: '↓', 2: '←', 3: '→'}[action]
        line += action
    print(line)
11
Expected output:
12
↓↓↓↓↓↓↓↓↓↓↓↓
↓↓↓↓↓↓↓↓↓↓↓↓
→→→→→→→→→→→↓
↑↑↑↑↑↑↑↑↑↑↑↑
13
The agent hugs the top of the grid, then descends along the rightmost column — a safe path that avoids the trap cells.

Value Iteration

Value Iteration combines evaluation and improvement into a single update rule: the value of each state is set directly to the maximum Q-value over all actions. The get_pi() step is identical to Policy Iteration.
# 策略评估(价值迭代版) — Policy evaluation (value iteration)
def get_values():
    new_values = np.zeros([4, 12])

    for row in range(4):
        for col in range(12):
            action_value = np.zeros(4)

            for action in range(4):
                action_value[action] = get_qsa(row, col, action)

            # 和策略迭代算法唯一的不同点
            # The only difference: take the MAX instead of the weighted sum
            new_values[row, col] = action_value.max()

    return new_values


# 循环迭代策略评估和策略提升,寻找最优解
for _ in range(10):
    for _ in range(100):
        values = get_values()
    pi = get_pi()
Value Iteration converges to the same optimal policy but can be faster in practice because it does not need to fully evaluate a policy before improving it.

FrozenLake (Gym)

The third notebook applies the same algorithms to the FrozenLake-v1 Gym environment — a 4×4 grid where the agent must reach a goal tile without falling into holes.

Setup

import gym
from matplotlib import pyplot as plt
%matplotlib inline

# 创建环境 — Create the environment
env = gym.make('FrozenLake-v1',
               render_mode='rgb_array',
               is_slippery=False,
               map_name='4x4',
               desc=['SFFF', 'FHFH', 'FFFH', 'HFFG'])
env.reset()

# 解封装才能访问状态转移矩阵P
# Unwrap to access the transition model P
env = env.unwrapped
The environment exposes env.P[state][action] — a list of (probability, next_state, reward, done) tuples — so we can use DP directly.
# 查看冰湖这个游戏的状态列表
# 一共4*4=16个状态,每个状态下都可以执行4个动作
len(env.P), env.P[0]
# (16, {0: [(1.0, 0, 0.0, False)], 1: [(1.0, 4, 0.0, False)], ...})

Q-Value with Stochastic Transitions

Because Gym encodes each action as a distribution over outcomes, we weight by transition probability:
import numpy as np

values = np.zeros(16)
pi = np.ones([16, 4]) * 0.25

algorithm = '价值迭代'  # or '策略迭代'


# 计算qsa — Compute Q(state, action)
def get_qsa(state, action):
    value = 0.0

    # 每个动作都会有三个不同的结果,这里要按概率加权求和
    for prop, next_state, reward, over in env.P[state][action]:
        next_value = values[next_state] * 0.9

        # 如果下个状态是终点或者陷阱,则下个状态的分数是0
        if over:
            next_value = 0

        next_value += reward
        next_value *= prop
        value += next_value

    return value

Shared Evaluation & Improvement

Both Policy Iteration and Value Iteration reuse the same get_pi(). Only get_values() differs:
def get_values():
    new_values = np.zeros([16])

    for state in range(16):
        action_value = np.zeros(4)

        for action in range(4):
            action_value[action] = get_qsa(state, action)

        if algorithm == '策略迭代':
            action_value *= pi[state]
            new_values[state] = action_value.sum()

        if algorithm == '价值迭代':
            new_values[state] = action_value.max()

    return new_values


def get_pi():
    new_pi = np.zeros([16, 4])

    for state in range(16):
        action_value = np.zeros(4)

        for action in range(4):
            action_value[action] = get_qsa(state, action)

        count = (action_value == action_value.max()).sum()

        for action in range(4):
            if action_value[action] == action_value.max():
                new_pi[state, action] = 1 / count
            else:
                new_pi[state, action] = 0

    return new_pi


# 循环迭代策略评估和策略提升,寻找最优解
for _ in range(10):
    for _ in range(100):
        values = get_values()
    pi = get_pi()

Display Policy

# 打印每个格子的策略 — Print policy for each cell
def print_pi():
    for row in range(4):
        line = ''
        for col in range(4):
            state = row * 4 + col

            if (row == 1 and col == 1) or (row == 1 and col == 3) or \
               (row == 2 and col == 3) or (row == 3 and col == 0):
                line += '○'   # hole
                continue

            if row == 3 and col == 3:
                line += '❤'   # goal
                continue

            line += '←↓→↑'[pi[state].argmax()]

        print(line)


print_pi()
Expected output:
↓→↓←
↓○↓○
→↓↓○
○→→❤

Summary

AlgorithmUpdate ruleConvergence
Policy EvaluationV(s) ← Σ_a π(a|s) Q(s,a)After many sweeps
Policy IterationAlternate evaluation + greedy improvementFinite steps (for finite MDP)
Value IterationV(s) ← max_a Q(s,a) each sweepTypically faster than PI
DP requires a complete model of the environment. For most real-world tasks this is unavailable, motivating model-free methods such as Temporal-Difference learning (Sarsa, Q-Learning) covered in later notebooks.

Build docs developers (and LLMs) love