Get Started with Q-Learning & MCTS for Connect 4

This guide walks you through everything needed to run agent matchups and train a Q-Learning agent against an MCTS opponent. By the end you will have watched two MCTS agents play a full game of Connect 4, evaluated a pre-trained Q-Learning agent against MCTS, and optionally kicked off a fresh 50,000-episode training run that produces a new q_data.dat.gz Q-table and a reward convergence plot.

Prerequisites

You need Python 3.8 or higher and three third-party packages. No other dependencies are required — the project uses only the standard library plus NumPy, pandas, and Matplotlib.

pip install numpy pandas matplotlib

Verify that the packages are available:

python -c "import numpy, pandas, matplotlib; print('All dependencies found.')"

Clone the Repository

Clone the project from GitHub and change into the project directory:

git clone https://github.com/marshalharman/QLearning_and_MCTS-Reinforcement_Learning.git
cd QLearning_and_MCTS-Reinforcement_Learning

The directory should contain the following files:

.
├── main.py
├── MCTS.py
├── QLearning.py
├── RandomPlayer.py
└── q_data.dat.gz     ← pre-trained Q-table (required for evaluation)

Run MCTS vs MCTS

Start the program and choose option 1 to pit two MCTS agents against each other on a 6 × 5 board. You will be prompted to set the playout budget for each player independently.

python main.py

The CLI interaction looks like this:

Choose from the following:
 1. MCTS agent vs MCTS agent
 2. MCTS agent vs Q-Learning agent
1
Enter number of playouts for player 1:
50
Enter number of playouts for player 2:
50

After every move the agent prints a brief summary and then displays the final board when the game ends:

0
Player1(MCTS with50 playouts)
Action selected : 2
Total playouts for next state: 50
Value of next state according to MCTS : 0.34

Player2(MCTS with50 playouts)
Action selected : 3
Total playouts for next state: 50
Value of next state according to MCTS : 0.41

...

PLAYER 1 WINS :
0 0 0 0 0
0 0 0 0 0
0 0 2 0 0
0 1 2 0 0
0 1 2 0 0
1 1 1 1 0

Each line of output contains:

Action selected — the column index (0-indexed) where the piece was dropped.
Total playouts for next state — the playout budget the other agent will use on the resulting board.
Value of next state according to MCTS — the win ratio (win / N) of the chosen child node.

More playouts give MCTS a stronger search but increase per-move time roughly linearly. A budget of 10–20 playouts is fast enough for interactive experimentation; 100+ starts to show noticeably stronger play. For a fair benchmark, set both players to the same value first, then try asymmetric budgets (e.g., 10 vs 100) to observe the skill gap.

Evaluate Q-Learning Against MCTS

From the top-level menu choose option 2, then option 2 again to run the pre-trained Q-Learning agent against an MCTS opponent for 100 test episodes on a 3 × 5 board.

Choose from the following:
 1. MCTS agent vs MCTS agent
 2. MCTS agent vs Q-Learning agent
2
Choose from the following:
 1. Train Q-Learning against MCTS agent
 2. Test Q-Learning against MCTS
2

This routine loads q_data.dat.gz from the current directory, decompresses it into q_data.dat, and injects the Q-table into a fresh Q_Learning instance. After all 100 episodes it prints aggregate statistics:

Draws: 0.12
Accuracy of P1, MCTS10: 0.54
Accuracy of P2, Q_Learning: 0.34
Convergence till n = 10 can be tested as q learning is trained against MC10.
Higher values can be tested but the results will show slightly less win percentage
due to lack of training for higher values.
Currently, the number rows is set to: 3

During each episode, every move is also logged:

Player 1 (MCTS with10 playouts
Action selected : 1
Value of next state according to MCTS : 0.28

Player 2 (Q-learning)
Action selected : 2
Value of next state : -1.0

The Player 1 line has no closing parenthesis and no space before the playout count — that is the exact string the source code prints: 'Player 1 (MCTS with' + str(n) + ' playouts'.

This step requires q_data.dat.gz to be present in the working directory. The file is included in the repository. If it is missing (for example after running train_qlearning() with a different board size), re-clone the repo or complete Step 5 below to generate a new one.

Train Q-Learning from Scratch

From the top-level menu choose option 2, then option 1 to start a full training run. The agent plays 50,000 episodes as Player 2 against an MCTS opponent (40 playouts) on a 4 × 5 board.

Choose from the following:
 1. MCTS agent vs MCTS agent
 2. MCTS agent vs Q-Learning agent
2
Choose from the following:
 1. Train Q-Learning against MCTS agent
 2. Test Q-Learning against MCTS
1

The episode counter prints to stdout so you can track progress:

0
P1 win
1
P2 win
2
Draw
...
49999
P2 win

Training uses the following hyperparameters (set in main.py):

num_iterations    = 50000   # total training episodes
n                 = 40      # MCTS playout budget for the opponent
C_MCTS            = 2       # UCB1 exploration constant
r, c              = 4, 5    # board dimensions (rows × columns)

# Q-Learning agent hyperparameters:
alpha             = 0.6     # learning rate
discount_factor   = 0.8     # future-reward discount (γ)
epsilon           = 0.1     # random-exploration probability

When training finishes, two files are written to the working directory:

File	Contents
`q_data.dat.gz`	Compressed, pickled Q-value dictionary — load this for evaluation
`MCTSvsQ.jpg`	1000-episode moving-average reward convergence plot

Training overwrites any existing q_data.dat.gz in the current directory. If you want to preserve a previous Q-table, move or rename it before starting a new training run. The reward plot MCTSvsQ.jpg is also overwritten on each run.

Next Steps

Once you have the basics running, here are a few directions to explore:

Tune MCTS Playouts

Try asymmetric playout budgets in MCTS_vs_MCTS (e.g., 10 vs 200) to quantify how much each additional playout is worth.

Adjust the Reward Schedule

Experiment with the win/draw/loss/step rewards in QLearning.py (+50, −10, −50, −1) to see whether different incentives change convergence speed or final win rate.

Change the Board Size

Pass different r and c values in train_qlearning() and MCTS_vs_Q() to explore how board dimensions affect Q-table size and training time.

Swap in a Random Baseline

Replace one agent with a Random_Player instance directly in main.py to establish a lower-bound baseline for both MCTS and Q-Learning win rates.

Get Started

Concepts

Agents

Training & Evaluation

Next Steps

Tune MCTS Playouts

Adjust the Reward Schedule

Change the Board Size

Swap in a Random Baseline

Build docs developers (and LLMs) love

Get Started

Concepts

Agents

Training & Evaluation

Documentation Index

​Next Steps

Tune MCTS Playouts

Adjust the Reward Schedule

Change the Board Size

Swap in a Random Baseline

Build docs developers (and LLMs) love

Next Steps