Markov Decision Processes: Monte Carlo and Bellman Equations

A Markov Decision Process (MDP) is the mathematical backbone of most reinforcement learning algorithms. It formalises the interaction between an agent and an environment as a sequence of state transitions driven by rewards.

MDP Definition

An MDP is defined by the tuple (S, A, P, R, γ):

Symbol	Meaning
S	Set of states
A	Set of actions
P(s′ \| s)	Transition probability from state `s` to `s′`
R(s, s′)	Immediate reward received on transition `s → s′`
γ	Discount factor `∈ [0, 1)`

The Markov property says the future depends only on the current state, not on the history: P(s_{t+1} | s_t, s_{t-1}, …, s_0) = P(s_{t+1} | s_t).

Example: 5-State Chain

The notebooks use a 5-state chain (states 0–4) where state 4 is a terminal absorbing state.

import numpy as np

# 状态转移概率矩阵 — Transition probability matrix [5, 5]
P = np.array([
    [0.5, 0.5, 0.0, 0.0, 0.0],
    [0.5, 0.0, 0.5, 0.0, 0.0],
    [0.0, 0.0, 0.0, 0.5, 0.5],
    [0.0, 0.1, 0.2, 0.2, 0.5],
    [0.0, 0.0, 0.0, 0.0, 0.0],  # terminal state
])

# 反馈矩阵，-100的位置是不可能走到的
# Reward matrix; -100 marks impossible transitions
R = np.array([
    [-1.0,    0.0, -100.0, -100.0, -100.0],
    [-1.0, -100.0,   -2.0, -100.0, -100.0],
    [-100.0, -100.0, -100.0,  -2.0,    0.0],
    [-100.0,   1.0,   1.0,   1.0,  10.0],
    [-100.0, -100.0, -100.0, -100.0, -100.0],
])

Monte Carlo Method

The Monte Carlo method estimates state values empirically by generating many episodes (chains) and averaging the discounted returns.

Generating Episodes

import numpy as np
import random


# 生成一个chain — Generate a single episode
def get_chain(max_lens):
    ss = []   # states visited
    rs = []   # rewards collected

    # 随机选择一个除4以外的状态作为起点 — Random start (not terminal)
    s = random.choice(range(4))
    ss.append(s)

    for _ in range(max_lens):
        # 按照P的概率，找到下一个状态 — Sample next state
        s_next = np.random.choice(np.arange(5), p=P[s])

        # 取到r — Get reward
        r = R[s, s_next]

        s = s_next
        ss.append(s)
        rs.append(r)

        # 如果状态到了4则结束 — Stop at terminal state
        if s == 4:
            break

    return ss, rs


# 生成N个chain — Generate N episodes
def get_chains(N, max_lens):
    ss = []
    rs = []
    for _ in range(N):
        s, r = get_chain(max_lens)
        ss.append(s)
        rs.append(r)
    return ss, rs

Computing Returns

# 给定一条链，计算回报 — Compute discounted return for an episode
def get_value(rs):
    total = 0
    for i, r in enumerate(rs):
        # 给每一步的反馈做一个系数，随着步数往后衰减
        # Apply discount: earlier steps matter more
        total += 0.5 ** i * r
    return total

Monte Carlo Value Estimation

# 蒙特卡洛法评估每个状态的价值 — MC value estimation
def get_values_by_monte_carlo(ss, rs):
    # 记录5个不同开头的价值 — Collect returns per starting state
    values = [[] for _ in range(5)]

    for s, r in zip(ss, rs):
        values[s[0]].append(get_value(r))

    # 求每个开头的平均价值 — Average return per state
    return [np.mean(i) for i in values]


# Run with 2000 episodes
# Reference: -1.23, -1.70, 0.48, 5.97, 0
get_values_by_monte_carlo(*get_chains(2000, 20))

As the number of episodes increases, the Monte Carlo estimates converge to the true state values.

Bellman Equation

The Bellman equation expresses the value of a state V(s) as the expected immediate reward plus the discounted value of the next state:

V(s) = Σ_{s'} P(s'|s) · [R(s,s') + γ · V(s')]

In matrix form this becomes:

V = R + γ P V
V - γ P V = R
(I - γ P) V = R
V = (I - γ P)⁻¹ R

The notebooks demonstrate two approaches to solving this system.

Iterative (Gradient Descent) Solution

The second notebook redefines the MDP as a 6-state chain with a per-state reward vector R:

# 梯度下降法计算贝尔曼矩阵 — Iterative Bellman solution
# 6-state MDP with per-state reward vector
P = np.array([
    [0.9, 0.1, 0.0, 0.0, 0.0, 0.0],
    [0.5, 0.0, 0.5, 0.0, 0.0, 0.0],
    [0.0, 0.0, 0.0, 0.6, 0.0, 0.4],
    [0.0, 0.0, 0.0, 0.0, 0.3, 0.7],
    [0.0, 0.2, 0.3, 0.5, 0.0, 0.0],
    [0.0, 0.0, 0.0, 0.0, 0.0, 1.0],
])

R = np.array([-1, -2, -2, 10, 1, 0])


def get_bellman():
    # 初始化values — Initialise values
    value = np.ones([6])

    for _ in range(200):
        for i in range(6):
            # 每一行的概率 × 当前value，× gamma，+ 奖励
            # V(s) ← R(s) + γ Σ P(s'|s) V(s')
            value[i] = R[i] + 0.5 * P[i].dot(value)

    return value


get_bellman()
# array([-2.02, -2.21,  1.16, 10.54,  3.59,  6.22e-61])

Analytical (Matrix Inverse) Solution

# 解析解贝尔曼方程矩阵 — Analytical Bellman solution
def get_bellman():
    mat = np.eye(*P.shape)
    mat -= 0.5 * P           # (I - γP)
    mat = np.linalg.inv(mat) # (I - γP)^{-1}
    return mat.dot(R)        # V = (I - γP)^{-1} R


get_bellman()
# array([-2.02, -2.21,  1.16, 10.54,  3.59,  0.  ])

Both methods converge to the same values. The analytical solution is exact (given floating-point precision) but requires matrix inversion — which is expensive for large state spaces.

Comparison: Monte Carlo vs Bellman

Method	Requires model?	Handles large spaces?	Notes
Monte Carlo	No	Yes (sampling)	Requires full episodes
Bellman (iterative)	Yes (P, R)	Moderate	Converges slowly
Bellman (analytic)	Yes (P, R)	No (O(n³) inversion)	Exact for small MDPs

In practice, when the full transition model is unknown (as in most real tasks), model-free methods like Monte Carlo or Temporal-Difference learning are used instead.

Get Started

Foundations

Tabular & Model-Based Methods

Deep RL Algorithms

Advanced Topics

Markov Decision Processes: Monte Carlo and Bellman Equations

MDP Definition

Example: 5-State Chain

Monte Carlo Method

Generating Episodes

Computing Returns

Monte Carlo Value Estimation

Bellman Equation

Iterative (Gradient Descent) Solution

Analytical (Matrix Inverse) Solution

Comparison: Monte Carlo vs Bellman

Build docs developers (and LLMs) love

Get Started

Foundations

Tabular & Model-Based Methods

Deep RL Algorithms

Advanced Topics

Documentation Index

​MDP Definition

​Example: 5-State Chain

​Monte Carlo Method

​Generating Episodes

​Computing Returns

​Monte Carlo Value Estimation

​Bellman Equation

​Iterative (Gradient Descent) Solution

​Analytical (Matrix Inverse) Solution

​Comparison: Monte Carlo vs Bellman

Build docs developers (and LLMs) love

MDP Definition

Example: 5-State Chain

Monte Carlo Method

Generating Episodes

Computing Returns

Monte Carlo Value Estimation

Bellman Equation

Iterative (Gradient Descent) Solution

Analytical (Matrix Inverse) Solution

Comparison: Monte Carlo vs Bellman