Q-Learning Agent API Reference for Connect 4

The Q_Learning class implements a tabular Q-Learning agent for Connect 4. It maintains a dictionary-based Q-table keyed by serialized (state, action) strings and updates Q-values via the Bellman equation after each move. Mirror symmetry of the board is exploited automatically to halve the effective table size, and the agent supports both online training (against an MCTS opponent) and zero-update evaluation of a pre-trained Q-table.

Class: `Q_Learning`

class Q_Learning:
    def __init__(self, player, alpha, discount_factor, epsilon, r, c):
        ...

Constructor Parameters

player

int

required

Player token this agent represents — 1 or 2. Used when placing pieces on the board and when checking winning states.

alpha

float

required

Learning rate controlling how aggressively Q-values are updated. Set to 0 during evaluation (no learning) or 0.6 during training. The Bellman update applied each step is:

Q(s,a) ← Q(s,a) + α · (R + γ · max Q(s',a') − Q(s,a))

discount_factor

float

required

Gamma (γ) — how much future rewards are weighted relative to immediate rewards. A value of 0.8 is used during training. Set to any value when alpha=0 as it has no effect.

epsilon

float

required

Exploration rate for the epsilon-greedy policy. 0 means fully greedy (always pick the highest Q-value action); 0.1 during training allows occasional random exploration of non-greedy columns.

int

required

Number of rows in the board. The training setup in main.py uses r=4 for faster convergence; the evaluation setup uses r=3.

int

required

Number of columns in the board. Both training and evaluation use c=5.

Instance Attributes

Attribute	Type	Initial Value	Description
`player`	`int`	from constructor	Player token (`1` or `2`).
`alpha`	`float`	from constructor	Learning rate.
`epsilon`	`float`	from constructor	Exploration rate.
`discount_factor`	`float`	from constructor	Gamma (γ) discount factor.
`r`	`int`	from constructor	Board row count.
`c`	`int`	from constructor	Board column count.
`previous_state`	`list \| None`	`None`	Board state from the previous turn, used to compute the Bellman update.
`previous_action`	`int \| None`	`None`	Column index chosen on the previous turn.
`game_status`	`str`	`"running"`	Outcome flag read during the terminal update call. Set to `"win"`, `"loss"`, or `"draw"` by the game loop before calling `take_action()` one final time.
`total_rewards`	`float`	`0`	Accumulated reward for the current episode. Reset externally between episodes.
`q_values`	`dict`	set via `set_Qvalues()`	Q-table. Keys are state-action strings; values are `float` Q-scores. Unseen pairs default to `0`.

Methods

`set_Qvalues(q_values)`

Loads a pre-existing Q-table into the agent. Must be called before take_action() in evaluation mode; during training, pass an empty {} dict to start fresh. Parameters

q_values

dict

A dictionary mapping state-action strings to float Q-values. Typically loaded from a pickle file (see usage example below).

Returns: None

`set_state(state)`

Loads the current board so take_action() can read it. Must be called at the start of every turn. Parameters

state

list[list[int]]

A 2D list of shape [r][c]. Cells hold 0 (empty), 1, or 2.

Returns: None

`take_action() -> tuple`

The primary action method. Its behaviour changes depending on self.game_status: During normal play (game_status == "running"):

Serialises the current board and looks up (or initialises) Q-values for every valid column.
Applies the Bellman update for the previous move using the current state as s'.
Selects an action via epislon_greedy_policy().
Applies the chosen action to the board via gravity.
If the resulting state is terminal, immediately updates Q(s, a) with the terminal reward.

Returns (normal play) — 5-tuple:

next_state

list[list[int]]

Updated board after the chosen move.

is_terminal

bool

True if the game ended on this move.

result

str

"win", "draw", or "..".

action

int

Column index (0-based) chosen by the agent.

q_value

float

Q(s, a) for the chosen action at the current state — reflects the agent’s estimated value of the move.

Terminal reward update (game_status != "running"): When the game loop sets game_status to "win", "loss", or "draw" and calls take_action() one final time, the method updates the Q-value for the last move with the terminal reward and returns a 2-tuple:

state

list[list[int]]

The board state at termination (unchanged).

True

bool

Always True — the game is over.

The terminal update path returns only 2 values (self.state, True), not 5. Callers that set game_status and invoke take_action() a final time must not attempt to unpack 5 values from that call.

Reward schedule:

Outcome	Reward (R)
Win (agent places winning piece)	`+50`
Draw	`-10`
Loss (`game_status == "loss"`)	`-50`
Every non-terminal move	`-1`

`epislon_greedy_policy(next_q) -> np.int64`

Selects a column using the epsilon-greedy strategy. Note: the method name preserves a typo from the source (epislon, not epsilon). Parameters

next_q

list[float]

A list of Q-values, one per column. Full columns have their entry set to -math.inf to prevent illegal moves.

Returns: np.int64 — the chosen column index. With probability epsilon a valid column is chosen uniformly at random (exploration). Otherwise, the column with the highest Q-value is chosen greedily (exploitation). Ties among greedy actions are broken randomly.

`mirror_state_action(curr_state, action) -> str`

Computes the Q-table key for the horizontally mirrored board and the correspondingly flipped column index. This symmetry means the agent never needs to store a state and its mirror separately — they always share the same Q-value entry. Parameters

curr_state

list[list[int]]

The board to mirror.

action

int

The column index to flip: the mirrored action is c - 1 - action.

Returns: str — concatenation of the mirrored board cells followed by the mirrored column index.

`is_terminal_state(next_state, action) -> tuple[bool, str]`

Returns (True, "win") if the last piece created four in a row, (True, "draw") if all columns are full, or (False, "..") otherwise.

`is_winning_state(next_state, action) -> bool`

Checks four directions (diagonal, anti-diagonal, horizontal, vertical) from the cell where the last piece landed. Returns True if self.player’s token appears four times consecutively (the internal counter reaches 3, meaning 3 additional pieces in line beyond the placed piece, totalling 4).

Usage Example (Evaluation)

import pickle
import gzip
import shutil
import numpy as np
from QLearning import Q_Learning

# Decompress and load the pre-trained Q-table
with gzip.open('q_data.dat.gz', 'rb') as f_in:
    with open('q_data.dat', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

with open('q_data.dat', 'rb') as handle:
    q_values = pickle.load(handle)

# Build agent: alpha=0 and epsilon=0 for pure greedy evaluation
game = np.zeros((3, 5), dtype=int)
agent = Q_Learning(player=2, alpha=0, discount_factor=0.8, epsilon=0, r=3, c=5)
agent.set_Qvalues(q_values)
agent.set_state(game)

next_state, end, result, action, value = agent.take_action()
print(f"Q-Learning agent chose column {action} (Q={value:.4f})")

Always set alpha=0 and epsilon=0 when evaluating a pre-trained agent. If either is non-zero, take_action() will keep mutating the loaded Q-table in-place, silently degrading the policy across evaluation episodes.

Usage Example (Training)

import numpy as np
from MCTS import MCTS
from QLearning import Q_Learning

q_values = {}   # start with an empty Q-table
rows, cols = 4, 5

for episode in range(50000):
    game = np.zeros((rows, cols), dtype=int)
    player1 = MCTS(play_outs=40, player=1, C=2, r=rows, c=cols)
    player2 = Q_Learning(player=2, alpha=0.6, discount_factor=0.8,
                          epsilon=0.1, r=rows, c=cols)
    player2.set_Qvalues(q_values)

    turn = 0
    while True:
        if turn == 0:
            player1.set_state(game)
            game, end, result, last_action, value = player1.take_action()
            turn = 1
        else:
            player2.set_state(game)
            game, end, result, last_action, value = player2.take_action()
            turn = 0

        if end:
            if turn == 1 and result == "win":
                player2.game_status = "loss"
            elif result == "draw":
                player2.game_status = "draw"
            player2.take_action()   # terminal Q-value update — returns 2-tuple
            break

Training is done on a smaller board (r=4, c=5) for faster convergence. The resulting Q-table is then tested on a 3-row board via MCTS_vs_Q() in main.py. Larger boards can be trained but require significantly more episodes.

Get Started

Concepts

Agents

Training & Evaluation

Class: `Q_Learning`

Constructor Parameters

Instance Attributes

Methods

`set_Qvalues(q_values)`

`set_state(state)`

`take_action() -> tuple`

`epislon_greedy_policy(next_q) -> np.int64`

`mirror_state_action(curr_state, action) -> str`

`is_terminal_state(next_state, action) -> tuple[bool, str]`

`is_winning_state(next_state, action) -> bool`

Usage Example (Evaluation)

Usage Example (Training)

Build docs developers (and LLMs) love

Get Started

Concepts

Agents

Training & Evaluation

Documentation Index

​Class: Q_Learning

​Constructor Parameters

​Instance Attributes

​Methods

​set_Qvalues(q_values)

​set_state(state)

​take_action() -> tuple

​epislon_greedy_policy(next_q) -> np.int64

​mirror_state_action(curr_state, action) -> str

​is_terminal_state(next_state, action) -> tuple[bool, str]

​is_winning_state(next_state, action) -> bool

​Usage Example (Evaluation)

​Usage Example (Training)

Build docs developers (and LLMs) love

Class: `Q_Learning`

Constructor Parameters

Instance Attributes

Methods

`set_Qvalues(q_values)`

`set_state(state)`

`take_action() -> tuple`

`epislon_greedy_policy(next_q) -> np.int64`

`mirror_state_action(curr_state, action) -> str`

`is_terminal_state(next_state, action) -> tuple[bool, str]`

`is_winning_state(next_state, action) -> bool`

Usage Example (Evaluation)

Usage Example (Training)