Documentation Index
Fetch the complete documentation index at: https://mintlify.com/marshalharman/QLearning_and_MCTS-Reinforcement_Learning/llms.txt
Use this file to discover all available pages before exploring further.
Both the MCTS and Q-Learning agents share the same Connect 4 environment model: a NumPy 2D array where pieces fall under gravity, valid moves are checked by inspecting the top row, and terminal states are detected by scanning four-in-a-row patterns around the last-played piece.
Board Layout
The board is a NumPy 2D array of shape (rows, columns) initialized with np.zeros and immediately cast to int. The cell encoding is straightforward:
| Value | Meaning |
|---|
0 | Empty cell |
1 | Player 1’s piece |
2 | Player 2’s piece |
Board dimensions vary by game mode:
| Mode | Rows | Columns |
|---|
| MCTS vs MCTS | 6 | 5 |
| Q-Learning training | 4 | 5 |
| MCTS vs Q-Learning (testing) | 3 | 5 |
import numpy as np
# MCTS vs MCTS — default 6×5
game = np.zeros((6, 5))
game = game.astype(int)
# Print the board
print('\n'.join(' '.join(str(x) for x in row) for row in game))
Gravity Mechanic
When a player drops a piece into column i, the piece falls to the lowest unoccupied row. The code iterates from the bottom row upward until it finds a zero cell:
# From RandomPlayer.py — take_action()
action = self.random_action()
for x in range(self.r):
if self.state[self.r-1-x][action] == 0:
self.state[self.r-1-x][action] = self.player
break
The same pattern appears in MCTS.py during expansion(), where each candidate child state is built by dropping the current player’s piece into an open column:
# From MCTS.py — expansion()
for x in range(self.r):
if(new[self.r-1-x][i] == 0):
if( depth%2 == 0):
new[self.r-1-x][i] = self.player
else:
new[self.r-1-x][i] = self.player%2 + 1
break
Valid Actions
A column index i is a valid action if and only if the top cell of that column is empty:
# Valid if state[0][i] == 0
for i in range(self.c):
if curr_state[0][i] == 0:
# column i is a legal move
This is checked across all four classes (MCTS, Q_Learning, Random_Player, and the main game loop) in the same way.
Terminal State Detection
The is_terminal_state method returns a (bool, str) tuple:
| Return value | Meaning |
|---|
(True, "win") | The last move was a winning move |
(True, "draw") | All top-row cells are occupied; no winner |
(False, "..") | Game is still in progress |
# From MCTS.py — is_terminal_state()
def is_terminal_state(self, next_state, action):
if self.is_winning_state(next_state, action):
return True, "win"
for y in range(self.c):
if(next_state[0][y] == 0):
return False, ".."
return True, "draw"
Draw detection scans every column’s top cell (next_state[0][y]). If all are non-zero, the board is full and the game is a draw.
Win Detection Algorithm
is_winning_state finds the row of the last-placed piece by scanning downward from row 0 in the played column, then checks all four directional axes using direction vectors.
# From MCTS.py — is_winning_state()
def is_winning_state(self, next_state, action):
y = action
x = 0
for i in range(self.r):
if next_state[i][y] != 0:
break
x += 1
directions = [ [1,1], [1,-1], [0,1], [1,0] ]
for d in directions:
for i in range(4):
count = 0
for j in range(i):
x_dash = x + (j+1)*d[0]
y_dash = y + (j+1)*d[1]
if( self.out_of_bounds(x_dash,y_dash) or next_state[x_dash][y_dash] != self.player):
break
count+=1
for j in range(3-i):
x_dash = x - (j+1)*d[0]
y_dash = y - (j+1)*d[1]
if( self.out_of_bounds(x_dash,y_dash) or next_state[x_dash][y_dash] != self.player):
break
count+=1
if(count == 3):
return True
return False
The four direction vectors cover:
| Vector | Axis |
|---|
[1, 1] | Diagonal (↘) |
[1, -1] | Anti-diagonal (↙) |
[0, 1] | Horizontal (→) |
[1, 0] | Vertical (↓) |
For each direction and each possible split i (0–3), the algorithm counts how many consecutive friendly neighbor pieces extend in both the positive and negative direction from the played cell (x, y). The piece at (x, y) itself is not added to count, so the win condition count == 3 means three additional consecutive same-color pieces were found — four pieces in a row in total when the placed piece is included.
The out_of_bounds guard prevents index errors at board edges:
def out_of_bounds(self, x, y):
return not(x >= 0 and x < self.r and y >= 0 and y < self.c)
The win check receives only the action (column index) of the last move — it does not scan the entire board. The row is derived at call time from the played column. This means is_winning_state must be called immediately after each move with the correct action argument.