The MCTS agent builds a game tree incrementally by running many short simulations (playouts). Each playout passes through four phases — selection, expansion, simulation, and back-propagation — that together balance exploring new moves against exploiting moves already known to be good.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/marshalharman/QLearning_and_MCTS-Reinforcement_Learning/llms.txt
Use this file to discover all available pages before exploring further.
Node Structure
Every position in the tree is represented by aNode object that stores the board state and accumulates statistics across all playouts that passed through it.
| Field | Type | Meaning |
|---|---|---|
N | int | Total number of times this node has been visited |
win | int | Cumulative score updated by back_propagation: +1 per win, −1 per loss, 0 per draw |
children | list[Node] | Child nodes, one per legal column in this position |
state | list[list] | Deep-copied board snapshot at this node |
UCB1 Formula
During selection the algorithm must decide which child to explore next. It uses the Upper Confidence Bound 1 (UCB1) formula to balance exploitation (high win rate) with exploration (low visit count):C (default 2) controls the exploration–exploitation trade-off. A larger C pushes the agent toward less-visited nodes; a smaller C makes it stick to nodes with the best observed win rate. Any child with N == 0 (never visited) is selected uniformly at random before UCB1 is applied, ensuring every child is sampled at least once.
The Four Phases
Selection
Starting from the root, the algorithm descends the tree by repeatedly choosing the child with the highest UCB1 score until it reaches a node with no children (a leaf).Every node visited during descent is appended to
self.parents so that back-propagation can update them all.Expansion
When a leaf is reached,
expansion() creates one child node for each valid column move. The current player is assigned based on the parity of depth: even depth means the MCTS agent’s own player; odd depth means the opponent.Depth parity controls turn-taking:
depth % 2 == 0 assigns self.player; depth % 2 == 1 assigns self.player % 2 + 1 (the opponent). This correctly models alternating moves through all levels of the tree.Simulation
From the newly expanded node, two After the random playout terminates, the result is translated into
Random_Player instances play the game out to completion by making random moves. The player assignment at simulation start again respects depth parity so the correct player moves first."win", "loss", or "draw" relative to the MCTS agent before being passed to back-propagation.Back-Propagation
Every node recorded in Applying the same
self.parents during selection and expansion is updated: visit count N increments by 1, and win shifts by +1 for a win, -1 for a loss, and 0 for a draw.x to all ancestors means nodes on winning paths accumulate higher win totals, making them more attractive to UCB1 in later iterations.Pre-Built Subtree
Before any playouts begin,construct_tree() pre-expands the first four levels of the game tree deterministically. This seeds the tree with a realistic set of positions so that early playouts are not wasted exploring trivially bad lines.
depth=4:
Action Selection
After all playouts, the agent picks the child of the root with the highest visit countN, not the highest win ratio:
N naturally favors well-explored, high-confidence moves.