Documentation Index
Fetch the complete documentation index at: https://mintlify.com/lansinuote/Simple_Reinforcement_Learning/llms.txt
Use this file to discover all available pages before exploring further.
A Markov Decision Process (MDP) is the mathematical backbone of most reinforcement learning algorithms. It formalises the interaction between an agent and an environment as a sequence of state transitions driven by rewards.
MDP Definition
An MDP is defined by the tuple (S, A, P, R, γ):
| Symbol | Meaning |
|---|
| S | Set of states |
| A | Set of actions |
| P(s′ | s) | Transition probability from state s to s′ |
| R(s, s′) | Immediate reward received on transition s → s′ |
| γ | Discount factor ∈ [0, 1) |
The Markov property says the future depends only on the current state, not on the history: P(s_{t+1} | s_t, s_{t-1}, …, s_0) = P(s_{t+1} | s_t).
Example: 5-State Chain
The notebooks use a 5-state chain (states 0–4) where state 4 is a terminal absorbing state.
import numpy as np
# 状态转移概率矩阵 — Transition probability matrix [5, 5]
P = np.array([
[0.5, 0.5, 0.0, 0.0, 0.0],
[0.5, 0.0, 0.5, 0.0, 0.0],
[0.0, 0.0, 0.0, 0.5, 0.5],
[0.0, 0.1, 0.2, 0.2, 0.5],
[0.0, 0.0, 0.0, 0.0, 0.0], # terminal state
])
# 反馈矩阵,-100的位置是不可能走到的
# Reward matrix; -100 marks impossible transitions
R = np.array([
[-1.0, 0.0, -100.0, -100.0, -100.0],
[-1.0, -100.0, -2.0, -100.0, -100.0],
[-100.0, -100.0, -100.0, -2.0, 0.0],
[-100.0, 1.0, 1.0, 1.0, 10.0],
[-100.0, -100.0, -100.0, -100.0, -100.0],
])
Monte Carlo Method
The Monte Carlo method estimates state values empirically by generating many episodes (chains) and averaging the discounted returns.
Generating Episodes
import numpy as np
import random
# 生成一个chain — Generate a single episode
def get_chain(max_lens):
ss = [] # states visited
rs = [] # rewards collected
# 随机选择一个除4以外的状态作为起点 — Random start (not terminal)
s = random.choice(range(4))
ss.append(s)
for _ in range(max_lens):
# 按照P的概率,找到下一个状态 — Sample next state
s_next = np.random.choice(np.arange(5), p=P[s])
# 取到r — Get reward
r = R[s, s_next]
s = s_next
ss.append(s)
rs.append(r)
# 如果状态到了4则结束 — Stop at terminal state
if s == 4:
break
return ss, rs
# 生成N个chain — Generate N episodes
def get_chains(N, max_lens):
ss = []
rs = []
for _ in range(N):
s, r = get_chain(max_lens)
ss.append(s)
rs.append(r)
return ss, rs
Computing Returns
# 给定一条链,计算回报 — Compute discounted return for an episode
def get_value(rs):
total = 0
for i, r in enumerate(rs):
# 给每一步的反馈做一个系数,随着步数往后衰减
# Apply discount: earlier steps matter more
total += 0.5 ** i * r
return total
Monte Carlo Value Estimation
# 蒙特卡洛法评估每个状态的价值 — MC value estimation
def get_values_by_monte_carlo(ss, rs):
# 记录5个不同开头的价值 — Collect returns per starting state
values = [[] for _ in range(5)]
for s, r in zip(ss, rs):
values[s[0]].append(get_value(r))
# 求每个开头的平均价值 — Average return per state
return [np.mean(i) for i in values]
# Run with 2000 episodes
# Reference: -1.23, -1.70, 0.48, 5.97, 0
get_values_by_monte_carlo(*get_chains(2000, 20))
As the number of episodes increases, the Monte Carlo estimates converge to the true state values.
Bellman Equation
The Bellman equation expresses the value of a state V(s) as the expected immediate reward plus the discounted value of the next state:
V(s) = Σ_{s'} P(s'|s) · [R(s,s') + γ · V(s')]
In matrix form this becomes:
V = R + γ P V
V - γ P V = R
(I - γ P) V = R
V = (I - γ P)⁻¹ R
The notebooks demonstrate two approaches to solving this system.
Iterative (Gradient Descent) Solution
The second notebook redefines the MDP as a 6-state chain with a per-state reward vector R:
# 梯度下降法计算贝尔曼矩阵 — Iterative Bellman solution
# 6-state MDP with per-state reward vector
P = np.array([
[0.9, 0.1, 0.0, 0.0, 0.0, 0.0],
[0.5, 0.0, 0.5, 0.0, 0.0, 0.0],
[0.0, 0.0, 0.0, 0.6, 0.0, 0.4],
[0.0, 0.0, 0.0, 0.0, 0.3, 0.7],
[0.0, 0.2, 0.3, 0.5, 0.0, 0.0],
[0.0, 0.0, 0.0, 0.0, 0.0, 1.0],
])
R = np.array([-1, -2, -2, 10, 1, 0])
def get_bellman():
# 初始化values — Initialise values
value = np.ones([6])
for _ in range(200):
for i in range(6):
# 每一行的概率 × 当前value,× gamma,+ 奖励
# V(s) ← R(s) + γ Σ P(s'|s) V(s')
value[i] = R[i] + 0.5 * P[i].dot(value)
return value
get_bellman()
# array([-2.02, -2.21, 1.16, 10.54, 3.59, 6.22e-61])
Analytical (Matrix Inverse) Solution
# 解析解贝尔曼方程矩阵 — Analytical Bellman solution
def get_bellman():
mat = np.eye(*P.shape)
mat -= 0.5 * P # (I - γP)
mat = np.linalg.inv(mat) # (I - γP)^{-1}
return mat.dot(R) # V = (I - γP)^{-1} R
get_bellman()
# array([-2.02, -2.21, 1.16, 10.54, 3.59, 0. ])
Both methods converge to the same values. The analytical solution is exact (given floating-point precision) but requires matrix inversion — which is expensive for large state spaces.
Comparison: Monte Carlo vs Bellman
| Method | Requires model? | Handles large spaces? | Notes |
|---|
| Monte Carlo | No | Yes (sampling) | Requires full episodes |
| Bellman (iterative) | Yes (P, R) | Moderate | Converges slowly |
| Bellman (analytic) | Yes (P, R) | No (O(n³) inversion) | Exact for small MDPs |
In practice, when the full transition model is unknown (as in most real tasks), model-free methods like Monte Carlo or Temporal-Difference learning are used instead.