Q-Learning maintains a lookup table — the Q-table — that maps everyDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/marshalharman/QLearning_and_MCTS-Reinforcement_Learning/llms.txt
Use this file to discover all available pages before exploring further.
(state, action) pair the agent has encountered to an estimated expected cumulative reward. By repeatedly playing games and updating those estimates with the Bellman equation, the agent gradually learns which moves lead to wins and which lead to losses.
State-Action Encoding
Board states are serialized to plain strings by concatenating every cell value row by row, left to right. The target column index is then appended directly to that string to form the dictionary key."000000000000000". Choosing column 2 produces the key "0000000000000002". This flat string representation lets the Q-table use a standard Python dictionary for O(1) lookups.
Q-Value Update (Bellman Equation)
After every move, the agent updates the Q-value of the previous state-action pair using the Bellman equation:alpha(learning rate) controls how much new information overrides old estimates.discount_factor(gamma) weights future rewards relative to the immediate reward.max_q_sDash_ais the maximum Q-value over all valid actions in the next states'.
Reward Structure
| Outcome | Reward R |
|---|---|
| Win | +50 |
| Loss | −50 |
| Draw | −10 |
| Per-step penalty | −1 |
−1 is applied on every non-terminal move, encouraging the agent to win quickly rather than dragging games out. Terminal rewards override it when the game ends:
Epsilon-Greedy Policy
At each turn the agent uses an epsilon-greedy policy to balance exploration and exploitation. With probabilityepsilon a random valid action is chosen; otherwise the action with the highest Q-value is chosen.
actions_list by assigning them a value of -math.inf before epislon_greedy_policy is called. When multiple actions tie for the greedy maximum, one is chosen uniformly at random from greedy_actions.
Mirror Symmetry
Connect 4 on a 5-column board is horizontally symmetric: the mirror image of any position is strategically equivalent. The agent exploits this by sharing Q-values between a state-action pair and its horizontal reflection, effectively halving the state space.self.c-1-n) mirrors the board, and remapping the action (self.c - 1 - action) maps, for example, column 0 to column 4 on a 5-column board.
When a new state-action key is first encountered, the mirror is checked before defaulting to zero:
Q-Table Persistence
After training completes, the Q-table is serialized withpickle and then gzip-compressed to keep file size manageable:
.gz file is decompressed and the dictionary loaded before any test games begin:
During evaluation, the
Q_Learning agent is instantiated with alpha=0 and epsilon=0 to disable learning and exploration entirely. This ensures the agent plays purely from its trained Q-table without modifying it: