Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/lansinuote/Simple_Reinforcement_Learning/llms.txt

Use this file to discover all available pages before exploring further.

The multi-armed bandit problem is one of the simplest settings for studying reinforcement learning. There is no state — you just pull one of K levers and collect a stochastic reward. The challenge is deciding which lever to pull next given only the history of past plays.

Problem Setup

We model 10 slot machines, each with an unknown win probability drawn uniformly from [0, 1].
import numpy as np

# 每个老虎机的中奖概率,0-1之间的均匀分布
# Win probability for each machine, uniform on [0, 1]
probs = np.random.uniform(size=10)

# 记录每个老虎机的返回值 — Record each machine's results
# Initialise with a dummy value of 1 to avoid division-by-zero in mean
rewards = [[1] for _ in range(10)]

probs, rewards

Exploration vs Exploitation

The fundamental tension in bandit problems:
  • Exploitation: pull the arm with the highest estimated reward so far.
  • Exploration: occasionally try other arms in case the estimate is wrong.
Too much exploitation leads to premature convergence on a sub-optimal arm. Too much exploration wastes pulls on arms that are already known to be bad.

Algorithms

The basic greedy algorithm picks the arm with the highest mean reward with probability 1 – ε, and a uniformly random arm with probability ε = 0.01.
import random


# 贪婪算法 — Greedy algorithm
def choose_one():
    # 有小概率随机选择一根拉杆 — Small probability of random arm
    if random.random() < 0.01:
        return random.randint(0, 9)

    # 计算每个老虎机的奖励平均 — Compute mean reward per machine
    rewards_mean = [np.mean(i) for i in rewards]

    # 选择期望奖励估值最大的拉杆 — Choose arm with highest estimated reward
    return np.argmax(rewards_mean)


def try_and_play():
    i = choose_one()

    # 玩老虎机,得到结果 — Play the machine, get result
    reward = 0
    if random.random() < probs[i]:
        reward = 1

    # 记录玩的结果 — Record the result
    rewards[i].append(reward)


def get_result():
    # 玩N次 — Play N times
    for _ in range(5000):
        try_and_play()

    # 期望的最好结果 — Best possible expected result
    target = probs.max() * 5000

    # 实际玩出的结果 — Actual result obtained
    result = sum([sum(i) for i in rewards])

    return target, result


get_result()
# Example output: (4150.02, 4077)
Because ε is fixed at 1 %, the algorithm keeps exploring even late in the game when there is little to learn.

Comparison

All four algorithms share the same try_and_play() and get_result() structure — only choose_one() changes.
def try_and_play():
    i = choose_one()

    # 玩老虎机,得到结果 — Play the machine
    reward = 0
    if random.random() < probs[i]:
        reward = 1

    # 记录玩的结果 — Record the result
    rewards[i].append(reward)


def get_result():
    # 玩N次 — Play N times
    for _ in range(5000):
        try_and_play()

    # 期望的最好结果 — Theoretical best
    target = probs.max() * 5000

    # 实际玩出的结果 — Actual total reward
    result = sum([sum(i) for i in rewards])

    return target, result
Typical results over 5 000 plays (exact values vary due to random probs):
AlgorithmTargetActualGap
Greedy (ε=0.01)~4150~4077~73
Decaying Greedy~4555~4540~15
UCB~4936~4553~383
Thompson Sampling~4698~4680~18
The UCB gap looks large in this run because UCB explores more aggressively during the burn-in phase. Over longer horizons it typically matches or outperforms greedy methods.

Key Takeaways

  • Greedy is simple but converges prematurely with a fixed ε.
  • Decaying Greedy adapts the exploration rate and often performs well.
  • UCB is theoretically motivated with provable regret bounds.
  • Thompson Sampling uses Bayesian posterior sampling and tends to perform well in practice with little hyperparameter tuning.

Build docs developers (and LLMs) love