Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/lansinuote/Simple_Reinforcement_Learning/llms.txt

Use this file to discover all available pages before exploring further.

A Markov Decision Process (MDP) is the mathematical backbone of most reinforcement learning algorithms. It formalises the interaction between an agent and an environment as a sequence of state transitions driven by rewards.

MDP Definition

An MDP is defined by the tuple (S, A, P, R, γ):
SymbolMeaning
SSet of states
ASet of actions
P(s′ | s)Transition probability from state s to s′
R(s, s′)Immediate reward received on transition s → s′
γDiscount factor ∈ [0, 1)
The Markov property says the future depends only on the current state, not on the history: P(s_{t+1} | s_t, s_{t-1}, …, s_0) = P(s_{t+1} | s_t).

Example: 5-State Chain

The notebooks use a 5-state chain (states 0–4) where state 4 is a terminal absorbing state.
import numpy as np

# 状态转移概率矩阵 — Transition probability matrix [5, 5]
P = np.array([
    [0.5, 0.5, 0.0, 0.0, 0.0],
    [0.5, 0.0, 0.5, 0.0, 0.0],
    [0.0, 0.0, 0.0, 0.5, 0.5],
    [0.0, 0.1, 0.2, 0.2, 0.5],
    [0.0, 0.0, 0.0, 0.0, 0.0],  # terminal state
])

# 反馈矩阵,-100的位置是不可能走到的
# Reward matrix; -100 marks impossible transitions
R = np.array([
    [-1.0,    0.0, -100.0, -100.0, -100.0],
    [-1.0, -100.0,   -2.0, -100.0, -100.0],
    [-100.0, -100.0, -100.0,  -2.0,    0.0],
    [-100.0,   1.0,   1.0,   1.0,  10.0],
    [-100.0, -100.0, -100.0, -100.0, -100.0],
])

Monte Carlo Method

The Monte Carlo method estimates state values empirically by generating many episodes (chains) and averaging the discounted returns.

Generating Episodes

import numpy as np
import random


# 生成一个chain — Generate a single episode
def get_chain(max_lens):
    ss = []   # states visited
    rs = []   # rewards collected

    # 随机选择一个除4以外的状态作为起点 — Random start (not terminal)
    s = random.choice(range(4))
    ss.append(s)

    for _ in range(max_lens):
        # 按照P的概率,找到下一个状态 — Sample next state
        s_next = np.random.choice(np.arange(5), p=P[s])

        # 取到r — Get reward
        r = R[s, s_next]

        s = s_next
        ss.append(s)
        rs.append(r)

        # 如果状态到了4则结束 — Stop at terminal state
        if s == 4:
            break

    return ss, rs


# 生成N个chain — Generate N episodes
def get_chains(N, max_lens):
    ss = []
    rs = []
    for _ in range(N):
        s, r = get_chain(max_lens)
        ss.append(s)
        rs.append(r)
    return ss, rs

Computing Returns

# 给定一条链,计算回报 — Compute discounted return for an episode
def get_value(rs):
    total = 0
    for i, r in enumerate(rs):
        # 给每一步的反馈做一个系数,随着步数往后衰减
        # Apply discount: earlier steps matter more
        total += 0.5 ** i * r
    return total

Monte Carlo Value Estimation

# 蒙特卡洛法评估每个状态的价值 — MC value estimation
def get_values_by_monte_carlo(ss, rs):
    # 记录5个不同开头的价值 — Collect returns per starting state
    values = [[] for _ in range(5)]

    for s, r in zip(ss, rs):
        values[s[0]].append(get_value(r))

    # 求每个开头的平均价值 — Average return per state
    return [np.mean(i) for i in values]


# Run with 2000 episodes
# Reference: -1.23, -1.70, 0.48, 5.97, 0
get_values_by_monte_carlo(*get_chains(2000, 20))
As the number of episodes increases, the Monte Carlo estimates converge to the true state values.

Bellman Equation

The Bellman equation expresses the value of a state V(s) as the expected immediate reward plus the discounted value of the next state:
V(s) = Σ_{s'} P(s'|s) · [R(s,s') + γ · V(s')]
In matrix form this becomes:
V = R + γ P V
V - γ P V = R
(I - γ P) V = R
V = (I - γ P)⁻¹ R
The notebooks demonstrate two approaches to solving this system.

Iterative (Gradient Descent) Solution

The second notebook redefines the MDP as a 6-state chain with a per-state reward vector R:
# 梯度下降法计算贝尔曼矩阵 — Iterative Bellman solution
# 6-state MDP with per-state reward vector
P = np.array([
    [0.9, 0.1, 0.0, 0.0, 0.0, 0.0],
    [0.5, 0.0, 0.5, 0.0, 0.0, 0.0],
    [0.0, 0.0, 0.0, 0.6, 0.0, 0.4],
    [0.0, 0.0, 0.0, 0.0, 0.3, 0.7],
    [0.0, 0.2, 0.3, 0.5, 0.0, 0.0],
    [0.0, 0.0, 0.0, 0.0, 0.0, 1.0],
])

R = np.array([-1, -2, -2, 10, 1, 0])


def get_bellman():
    # 初始化values — Initialise values
    value = np.ones([6])

    for _ in range(200):
        for i in range(6):
            # 每一行的概率 × 当前value,× gamma,+ 奖励
            # V(s) ← R(s) + γ Σ P(s'|s) V(s')
            value[i] = R[i] + 0.5 * P[i].dot(value)

    return value


get_bellman()
# array([-2.02, -2.21,  1.16, 10.54,  3.59,  6.22e-61])

Analytical (Matrix Inverse) Solution

# 解析解贝尔曼方程矩阵 — Analytical Bellman solution
def get_bellman():
    mat = np.eye(*P.shape)
    mat -= 0.5 * P           # (I - γP)
    mat = np.linalg.inv(mat) # (I - γP)^{-1}
    return mat.dot(R)        # V = (I - γP)^{-1} R


get_bellman()
# array([-2.02, -2.21,  1.16, 10.54,  3.59,  0.  ])
Both methods converge to the same values. The analytical solution is exact (given floating-point precision) but requires matrix inversion — which is expensive for large state spaces.

Comparison: Monte Carlo vs Bellman

MethodRequires model?Handles large spaces?Notes
Monte CarloNoYes (sampling)Requires full episodes
Bellman (iterative)Yes (P, R)ModerateConverges slowly
Bellman (analytic)Yes (P, R)No (O(n³) inversion)Exact for small MDPs
In practice, when the full transition model is unknown (as in most real tasks), model-free methods like Monte Carlo or Temporal-Difference learning are used instead.

Build docs developers (and LLMs) love