TheDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/marshalharman/QLearning_and_MCTS-Reinforcement_Learning/llms.txt
Use this file to discover all available pages before exploring further.
Q_Learning class implements a tabular Q-Learning agent for Connect 4. It maintains a dictionary-based Q-table keyed by serialized (state, action) strings and updates Q-values via the Bellman equation after each move. Mirror symmetry of the board is exploited automatically to halve the effective table size, and the agent supports both online training (against an MCTS opponent) and zero-update evaluation of a pre-trained Q-table.
Class: Q_Learning
Constructor Parameters
Player token this agent represents —
1 or 2. Used when placing pieces on
the board and when checking winning states.Learning rate controlling how aggressively Q-values are updated. Set to
0
during evaluation (no learning) or 0.6 during training. The Bellman update
applied each step is:Gamma (γ) — how much future rewards are weighted relative to immediate
rewards. A value of
0.8 is used during training. Set to any value when
alpha=0 as it has no effect.Exploration rate for the epsilon-greedy policy.
0 means fully greedy
(always pick the highest Q-value action); 0.1 during training allows
occasional random exploration of non-greedy columns.Number of rows in the board. The training setup in
main.py uses r=4 for
faster convergence; the evaluation setup uses r=3.Number of columns in the board. Both training and evaluation use
c=5.Instance Attributes
| Attribute | Type | Initial Value | Description |
|---|---|---|---|
player | int | from constructor | Player token (1 or 2). |
alpha | float | from constructor | Learning rate. |
epsilon | float | from constructor | Exploration rate. |
discount_factor | float | from constructor | Gamma (γ) discount factor. |
r | int | from constructor | Board row count. |
c | int | from constructor | Board column count. |
previous_state | list | None | None | Board state from the previous turn, used to compute the Bellman update. |
previous_action | int | None | None | Column index chosen on the previous turn. |
game_status | str | "running" | Outcome flag read during the terminal update call. Set to "win", "loss", or "draw" by the game loop before calling take_action() one final time. |
total_rewards | float | 0 | Accumulated reward for the current episode. Reset externally between episodes. |
q_values | dict | set via set_Qvalues() | Q-table. Keys are state-action strings; values are float Q-scores. Unseen pairs default to 0. |
Methods
set_Qvalues(q_values)
Loads a pre-existing Q-table into the agent. Must be called before take_action() in evaluation mode; during training, pass an empty {} dict to start fresh.
Parameters
A dictionary mapping state-action strings to float Q-values. Typically
loaded from a pickle file (see usage example below).
None
set_state(state)
Loads the current board so take_action() can read it. Must be called at the start of every turn.
Parameters
A 2D list of shape
[r][c]. Cells hold 0 (empty), 1, or 2.None
take_action() -> tuple
The primary action method. Its behaviour changes depending on self.game_status:
During normal play (game_status == "running"):
- Serialises the current board and looks up (or initialises) Q-values for every valid column.
- Applies the Bellman update for the previous move using the current state as
s'. - Selects an action via
epislon_greedy_policy(). - Applies the chosen action to the board via gravity.
- If the resulting state is terminal, immediately updates
Q(s, a)with the terminal reward.
Updated board after the chosen move.
True if the game ended on this move."win", "draw", or "..".Column index (0-based) chosen by the agent.
Q(s, a) for the chosen action at the current state — reflects the agent’s
estimated value of the move.game_status != "running"):
When the game loop sets game_status to "win", "loss", or "draw" and calls take_action() one final time, the method updates the Q-value for the last move with the terminal reward and returns a 2-tuple:
The board state at termination (unchanged).
Always
True — the game is over.| Outcome | Reward (R) |
|---|---|
| Win (agent places winning piece) | +50 |
| Draw | -10 |
Loss (game_status == "loss") | -50 |
| Every non-terminal move | -1 |
epislon_greedy_policy(next_q) -> np.int64
Selects a column using the epsilon-greedy strategy. Note: the method name preserves a typo from the source (epislon, not epsilon).
Parameters
A list of Q-values, one per column. Full columns have their entry set to
-math.inf to prevent illegal moves.np.int64 — the chosen column index.
With probability epsilon a valid column is chosen uniformly at random (exploration). Otherwise, the column with the highest Q-value is chosen greedily (exploitation). Ties among greedy actions are broken randomly.
mirror_state_action(curr_state, action) -> str
Computes the Q-table key for the horizontally mirrored board and the correspondingly flipped column index. This symmetry means the agent never needs to store a state and its mirror separately — they always share the same Q-value entry.
Parameters
The board to mirror.
The column index to flip: the mirrored action is
c - 1 - action.str — concatenation of the mirrored board cells followed by the mirrored column index.
is_terminal_state(next_state, action) -> tuple[bool, str]
Returns (True, "win") if the last piece created four in a row, (True, "draw") if all columns are full, or (False, "..") otherwise.
is_winning_state(next_state, action) -> bool
Checks four directions (diagonal, anti-diagonal, horizontal, vertical) from the cell where the last piece landed. Returns True if self.player’s token appears four times consecutively (the internal counter reaches 3, meaning 3 additional pieces in line beyond the placed piece, totalling 4).