Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/marshalharman/QLearning_and_MCTS-Reinforcement_Learning/llms.txt

Use this file to discover all available pages before exploring further.

The MCTS class implements a Monte Carlo Tree Search agent for Connect 4. It manages the full MCTS loop — selection, expansion, simulation, back-propagation — and returns the best action after a configurable number of playouts. When play_outs is set to 0, the agent falls back to a uniform random policy, making it easy to benchmark tree search against pure randomness without changing any other code.

Class: MCTS

class MCTS:
    def __init__(self, play_outs, player, C=2, r=6, c=5):
        ...

Constructor Parameters

play_outs
int
required
Number of MCTS simulations to run per move. Each simulation runs the full selection → expansion → simulation → back-propagation cycle. Set to 0 to bypass tree search entirely and fall back to random_action().
player
int
required
The player token this agent represents. Must be 1 or 2. The opponent’s token is derived automatically as player % 2 + 1.
C
float
default:"2"
UCB1 exploration constant. Higher values encourage wider exploration of unvisited nodes; lower values exploit already-promising branches more aggressively. The standard theoretical value is √2 ≈ 1.41, but 2 works well in practice for this board size.
r
int
default:"6"
Number of rows in the board. Defaults to 6 for a standard Connect 4 grid.
c
int
default:"5"
Number of columns in the board. Defaults to 5.

Methods

set_state(state)

Loads the current board into the agent before calling take_action(). Must be called at the start of every turn. Parameters
state
list[list[int]]
A 2D list of shape [r][c] representing the current board. Cells contain 0 (empty), 1 (player 1’s piece), or 2 (player 2’s piece).
Returns: None
Always call set_state() with the latest board before each call to take_action(). The agent does not retain state between turns automatically.

take_action()

Runs play_outs iterations of the full MCTS loop starting from the current state, then selects the most-visited child as the move. Returns When play_outs > 0:
state
list[list[int]]
Updated 2D board after the chosen move has been applied.
end
bool
True if the game is over (win or draw), False otherwise.
result
str
"win" if this agent just won, "draw" if the board is full with no winner, or ".." if the game is still in progress.
action
int
Column index (0-based) that the agent chose.
value
float
The win/visit ratio (res.win / res.N) of the selected child node. Reflects the tree’s confidence in the chosen move.
When play_outs=0, the method falls back to random_action() and returns only a 3-tuple: (state, end, result)without action or value. Ensure callers unpack accordingly when mixing playout counts.

random_action() -> np.int64

Selects a column uniformly at random from all non-full columns. Full columns receive probability weight 0; the remaining valid columns share equal probability. Used automatically by take_action() when play_outs=0. Returns: np.int64 — the chosen column index.
action = random.choices(actions, probabilities, k=1)
return np.int64(action[0])

construct_tree(root, depth)

Pre-builds the search tree to a fixed depth of 4 before the MCTS playout loop begins. Called internally by take_action() when play_outs > 0. For each valid column at each level, a child Node is created and appended to the parent’s children list, alternating player tokens by depth parity. Parameters
root
Node
The root node representing the current board state.
depth
int
Remaining depth to build. The initial call uses 4; recursion stops when depth <= 0.
Returns: None

selection(root, depth) -> tuple[Node, int]

Traverses the existing tree from root downward, choosing children via the UCB1 formula until a leaf node (one with no children) is reached. Parameters
root
Node
The node from which traversal begins — typically the root built by construct_tree().
depth
int
Current recursion depth, used to alternate player tokens during expansion. Pass 0 on the initial call.
Returns: (leaf: Node, depth: int) — the deepest unvisited or highest-UCB1 node and the depth at which it was found. UCB1 formula used internally:
ucb1 = node.get_ratio() + C * (math.log(root.N) / node.N) ** 0.5
Nodes with N == 0 are selected randomly among all unvisited siblings before UCB1 is evaluated.

expansion(leaf, depth) -> tuple[Node, int]

Expands leaf by adding one child node for every valid (non-full) column. Returns a new child node for simulation. Parameters
leaf
Node
The leaf node returned by selection().
depth
int
Current depth, used to determine which player’s token to place. Even depth → the MCTS agent’s own token; odd depth → the opponent’s token.
Returns: (child: Node, depth: int) — a new Node wrapping a copy of leaf.children[0]’s state and depth + 1.
If leaf has no valid moves (terminal state) or if leaf.N == 0 (unvisited leaf), expansion returns (leaf, depth) unchanged without incrementing depth. The early-return condition in source is:
if not (len(leaf.children) != 0 and leaf.N != 0):
    return leaf, depth

simulation(child, depth) -> str

Plays out a complete random game from child’s board state using two Random_Player instances, one for each player. Returns the outcome from the perspective of the MCTS agent. Parameters
child
Node
The newly expanded node whose board state is the rollout starting position.
depth
int
Current depth. Determines which Random_Player instance goes first in the rollout.
Returns: str"win" if the MCTS agent’s player wins the rollout, "loss" if the opponent wins, or "draw" if the board fills with no winner.

back_propagation(result, child)

Propagates the simulation result back through every node stored in self.parents, updating visit counts (N) and win scores (win). Parameters
result
str
The string returned by simulation(). The score added to each node’s win tally is: +1 for "win", -1 for "loss", 0 for "draw".
child
Node
The simulated child node. Its ancestors are stored in self.parents and are all updated in the loop.
Returns: None

is_terminal_state(next_state, action) -> tuple[bool, str]

Determines whether next_state is a terminal board position. Parameters
next_state
list[list[int]]
The board state after the most recent move.
action
int
Column index of the last piece placed. Used to anchor the win-check scan.
Returns:
  • (True, "win") — the agent’s last move created four in a row.
  • (True, "draw") — all columns are full with no winner.
  • (False, "..") — the game is still in progress.

is_winning_state(next_state, action) -> bool

Checks whether the piece just placed in column action completes four consecutive tokens belonging to self.player in any of four directions: diagonal [1,1], anti-diagonal [1,-1], horizontal [0,1], or vertical [1,0]. Parameters
next_state
list[list[int]]
Board state to evaluate.
action
int
Column index of the last piece placed. The method finds the row automatically by scanning down from the top.
Returns: boolTrue if four in a row exists, False otherwise.

Usage Example

import numpy as np
from MCTS import MCTS

# Initialise a blank 6×5 board
game = np.zeros((6, 5), dtype=int)

# Create an MCTS agent for player 1 with 10 playouts per move
player1 = MCTS(play_outs=10, player=1, C=2, r=6, c=5)

# Load the current board state and take a move
player1.set_state(game)
game, end, result, action, value = player1.take_action()

print(f"Player 1 chose column {action} (value={value:.3f})")
# e.g. → Player 1 chose column 2 (value=0.600)
Increase play_outs for stronger play at the cost of more compute time. Values between 10 and 100 are typical for interactive use; 1000+ are used for training comparisons.

Class: Node

The Node class represents a single board position in the MCTS search tree.
class Node:
    def __init__(self, state):
        ...

Constructor

state
list[list[int]]
A 2D board to snapshot. The constructor performs a copy.deepcopy so subsequent mutations to the source board do not affect this node.

Instance Fields

FieldTypeDescription
NintTotal number of times this node has been visited during back-propagation. Starts at 0.
winintCumulative win score. Updated by back_propagation(): +1 for a win, -1 for a loss, 0 for a draw. Starts at 0.
childrenlist[Node]Child nodes added during expansion. Empty until expansion() is called.
statelist[list[int]]Deep copy of the board at the time this node was created.

Methods

update(win) Increments self.N by 1. Also increments self.win by 1, but only if self.win is currently truthy (i.e., non-zero). Note that back_propagation() updates N and win directly on each node in self.parents rather than calling update(), so this method is not used in the main MCTS loop.
def update(self, win):
    if self.win == True:   # truthy check on self.win, not the `win` parameter
        self.win += 1
    self.N += 1
The win parameter is not used inside update(). The conditional checks self.win == True (the instance field), so self.win only increments when it is already 1. This is a quirk of the source implementation; direct field mutation via back_propagation() is the mechanism actually used during gameplay.
get_ratio() -> float Returns win / N. Returns 0 if N == 0 to avoid division by zero.
node = Node(board_state)
node.N = 5
node.win = 3
print(node.get_ratio())  # → 0.6

Build docs developers (and LLMs) love