TheDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/marshalharman/QLearning_and_MCTS-Reinforcement_Learning/llms.txt
Use this file to discover all available pages before exploring further.
MCTS class implements a Monte Carlo Tree Search agent for Connect 4. It manages the full MCTS loop — selection, expansion, simulation, back-propagation — and returns the best action after a configurable number of playouts. When play_outs is set to 0, the agent falls back to a uniform random policy, making it easy to benchmark tree search against pure randomness without changing any other code.
Class: MCTS
Constructor Parameters
Number of MCTS simulations to run per move. Each simulation runs the full
selection → expansion → simulation → back-propagation cycle. Set to
0 to
bypass tree search entirely and fall back to random_action().The player token this agent represents. Must be
1 or 2. The opponent’s
token is derived automatically as player % 2 + 1.UCB1 exploration constant. Higher values encourage wider exploration of
unvisited nodes; lower values exploit already-promising branches more
aggressively. The standard theoretical value is
√2 ≈ 1.41, but 2 works
well in practice for this board size.Number of rows in the board. Defaults to
6 for a standard Connect 4 grid.Number of columns in the board. Defaults to
5.Methods
set_state(state)
Loads the current board into the agent before calling take_action(). Must be called at the start of every turn.
Parameters
A 2D list of shape
[r][c] representing the current board. Cells contain 0
(empty), 1 (player 1’s piece), or 2 (player 2’s piece).None
Always call
set_state() with the latest board before each call to
take_action(). The agent does not retain state between turns automatically.take_action()
Runs play_outs iterations of the full MCTS loop starting from the current state, then selects the most-visited child as the move.
Returns
When play_outs > 0:
Updated 2D board after the chosen move has been applied.
True if the game is over (win or draw), False otherwise."win" if this agent just won, "draw" if the board is full with no winner,
or ".." if the game is still in progress.Column index (0-based) that the agent chose.
The win/visit ratio (
res.win / res.N) of the selected child node. Reflects
the tree’s confidence in the chosen move.When
play_outs=0, the method falls back to random_action() and returns
only a 3-tuple: (state, end, result) — without action or value.
Ensure callers unpack accordingly when mixing playout counts.random_action() -> np.int64
Selects a column uniformly at random from all non-full columns. Full columns receive probability weight 0; the remaining valid columns share equal probability. Used automatically by take_action() when play_outs=0.
Returns: np.int64 — the chosen column index.
construct_tree(root, depth)
Pre-builds the search tree to a fixed depth of 4 before the MCTS playout loop begins. Called internally by take_action() when play_outs > 0. For each valid column at each level, a child Node is created and appended to the parent’s children list, alternating player tokens by depth parity.
Parameters
The root node representing the current board state.
Remaining depth to build. The initial call uses
4; recursion stops when
depth <= 0.None
selection(root, depth) -> tuple[Node, int]
Traverses the existing tree from root downward, choosing children via the UCB1 formula until a leaf node (one with no children) is reached.
Parameters
The node from which traversal begins — typically the root built by
construct_tree().Current recursion depth, used to alternate player tokens during expansion.
Pass
0 on the initial call.(leaf: Node, depth: int) — the deepest unvisited or highest-UCB1 node and the depth at which it was found.
UCB1 formula used internally:
N == 0 are selected randomly among all unvisited siblings before UCB1 is evaluated.
expansion(leaf, depth) -> tuple[Node, int]
Expands leaf by adding one child node for every valid (non-full) column. Returns a new child node for simulation.
Parameters
The leaf node returned by
selection().Current depth, used to determine which player’s token to place. Even depth →
the MCTS agent’s own token; odd depth → the opponent’s token.
(child: Node, depth: int) — a new Node wrapping a copy of leaf.children[0]’s state and depth + 1.
If
leaf has no valid moves (terminal state) or if leaf.N == 0
(unvisited leaf), expansion returns (leaf, depth) unchanged without
incrementing depth. The early-return condition in source is:simulation(child, depth) -> str
Plays out a complete random game from child’s board state using two Random_Player instances, one for each player. Returns the outcome from the perspective of the MCTS agent.
Parameters
The newly expanded node whose board state is the rollout starting position.
Current depth. Determines which
Random_Player instance goes first in the
rollout.str — "win" if the MCTS agent’s player wins the rollout, "loss" if the opponent wins, or "draw" if the board fills with no winner.
back_propagation(result, child)
Propagates the simulation result back through every node stored in self.parents, updating visit counts (N) and win scores (win).
Parameters
The string returned by
simulation(). The score added to each node’s win
tally is: +1 for "win", -1 for "loss", 0 for "draw".The simulated child node. Its ancestors are stored in
self.parents and are
all updated in the loop.None
is_terminal_state(next_state, action) -> tuple[bool, str]
Determines whether next_state is a terminal board position.
Parameters
The board state after the most recent move.
Column index of the last piece placed. Used to anchor the win-check scan.
(True, "win")— the agent’s last move created four in a row.(True, "draw")— all columns are full with no winner.(False, "..")— the game is still in progress.
is_winning_state(next_state, action) -> bool
Checks whether the piece just placed in column action completes four consecutive tokens belonging to self.player in any of four directions: diagonal [1,1], anti-diagonal [1,-1], horizontal [0,1], or vertical [1,0].
Parameters
Board state to evaluate.
Column index of the last piece placed. The method finds the row automatically
by scanning down from the top.
bool — True if four in a row exists, False otherwise.
Usage Example
Class: Node
The Node class represents a single board position in the MCTS search tree.
Constructor
A 2D board to snapshot. The constructor performs a
copy.deepcopy so
subsequent mutations to the source board do not affect this node.Instance Fields
| Field | Type | Description |
|---|---|---|
N | int | Total number of times this node has been visited during back-propagation. Starts at 0. |
win | int | Cumulative win score. Updated by back_propagation(): +1 for a win, -1 for a loss, 0 for a draw. Starts at 0. |
children | list[Node] | Child nodes added during expansion. Empty until expansion() is called. |
state | list[list[int]] | Deep copy of the board at the time this node was created. |
Methods
update(win)
Increments self.N by 1. Also increments self.win by 1, but only if self.win is currently truthy (i.e., non-zero). Note that back_propagation() updates N and win directly on each node in self.parents rather than calling update(), so this method is not used in the main MCTS loop.
get_ratio() -> float
Returns win / N. Returns 0 if N == 0 to avoid division by zero.