This project implements two reinforcement learning agents that play Connect 4 against each other on a configurable grid. Built entirely in Python, it lets researchers and students experiment with Monte Carlo Tree Search (MCTS) and Q-Learning side by side — training Q-Learning from scratch against an MCTS opponent, evaluating the trained agent, or watching two MCTS agents compete head-to-head with different playout budgets.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/marshalharman/QLearning_and_MCTS-Reinforcement_Learning/llms.txt
Use this file to discover all available pages before exploring further.
Project Overview
The project contains two distinct AI strategies, each packaged as a self-contained agent class:| Agent | Strategy | Key hyperparameters |
|---|---|---|
MCTS | Tree search with random rollouts guided by UCB1 | play_outs, C (exploration constant) |
Q_Learning | Tabular Q-Learning with ε-greedy exploration | alpha, discount_factor, epsilon |
Architecture Overview
The project is organized across four Python modules, each with a focused responsibility:MCTS.py
Defines the
Node class (visit count, win score, children) and the MCTS agent. Implements the full four-phase loop — selection, expansion, simulation, and back-propagation — plus a lightweight construct_tree pre-builder that seeds the first four plies before playouts begin.QLearning.py
Defines the
Q_Learning agent. Stores Q-values in a plain Python dictionary keyed by a flattened board-plus-action string. Applies mirror symmetry to halve the effective state space and uses an ε-greedy policy for exploration during training.RandomPlayer.py
Defines
Random_Player, a uniform-random baseline agent. Used internally by MCTS during the simulation (rollout) phase — every playout is completed by two Random_Player instances playing to the end of the game.main.py
Entry point and orchestration layer. Exposes three top-level routines —
MCTS_vs_MCTS, MCTS_vs_Q, and train_qlearning — and presents a simple numbered CLI menu so you can choose a matchup without editing code.Quickstart & Further Reading
Quickstart
Install dependencies, clone the repo, and run your first agent matchup in minutes.
MCTS Concepts
Deep-dive into selection, expansion, simulation, back-propagation, and the UCB1 formula.
Q-Learning Concepts
Understand the Bellman update, ε-greedy exploration, and mirror-symmetry state encoding.
MCTS Agent Reference
Full constructor signature, method descriptions, and configuration examples for the
MCTS class.Shared Agent Interface
BothMCTS and Q_Learning expose the same two-method interface, which lets main.py swap agents in and out of the game loop without any conditional logic.
Random_Player, used only inside MCTS rollouts, follows a slightly simpler variant — its take_action() returns four values (game, end, result, action) and does not include a value estimate, because it does not maintain any value function.
Key Design Choices
Configurable Board (r × c)
Every agent constructor acceptsr (rows) and c (columns) as explicit arguments. There is no hardcoded board size anywhere in the agent classes themselves — they derive all loop bounds from self.r and self.c. This makes it straightforward to train on a smaller board and then test on a larger one.
Mirror Symmetry in Q-Learning
TheQ_Learning agent stores a Q-value for every (state, action) pair as a string key. Whenever a new (state, action) pair is first encountered, take_action() automatically looks up its horizontal mirror — reversing the column order of the board and inverting the action column index — and shares the same Q-value entry. This effectively halves the number of unique states the table must learn, accelerating convergence.
UCB1 in MCTS
During the selection phase, the MCTS agent picks the child node with the highest UCB1 score. Unvisited children are always explored first (chosen at random from the pool of zero-visit nodes). Once all children have been visited at least once, UCB1 balances exploitation of high win-rate nodes against exploration of infrequently visited ones:C defaults to 2 in all matchups.
Persistent Q-Table Storage
After training,train_qlearning() serializes the Q-value dictionary with pickle and compresses it with gzip, producing q_data.dat.gz. The evaluation routine MCTS_vs_Q() decompresses the file at runtime and injects the loaded table into a fresh Q_Learning instance via player2.set_Qvalues(q_values). This means you can train once and evaluate many times without re-running the 50,000-episode loop.
Reward Schedule
The Q-Learning agent uses a fixed reward schedule during training:| Outcome | Reward |
|---|---|
| Win | +50 |
| Draw | −10 |
| Loss | −50 |
| Every other step | −1 |
−1 encourages the agent to win as quickly as possible rather than prolonging games.
The board size differs across the three routines in
main.py. train_qlearning() uses a 4 × 5 board (rows=4, cols=5), MCTS_vs_Q() uses a 3 × 5 board (rows=3, cols=5), and MCTS_vs_MCTS() uses a 6 × 5 board. If you evaluate a Q-table trained on one board size against an MCTS agent running on a different board size, the state keys will not match and performance will degrade significantly.