This guide walks you through everything needed to run agent matchups and train a Q-Learning agent against an MCTS opponent. By the end you will have watched two MCTS agents play a full game of Connect 4, evaluated a pre-trained Q-Learning agent against MCTS, and optionally kicked off a fresh 50,000-episode training run that produces a newDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/marshalharman/QLearning_and_MCTS-Reinforcement_Learning/llms.txt
Use this file to discover all available pages before exploring further.
q_data.dat.gz Q-table and a reward convergence plot.
Prerequisites
You need Python 3.8 or higher and three third-party packages. No other dependencies are required — the project uses only the standard library plus NumPy, pandas, and Matplotlib.Verify that the packages are available:
Clone the Repository
Clone the project from GitHub and change into the project directory:The directory should contain the following files:
Run MCTS vs MCTS
Start the program and choose option 1 to pit two MCTS agents against each other on a 6 × 5 board. You will be prompted to set the playout budget for each player independently.The CLI interaction looks like this:After every move the agent prints a brief summary and then displays the final board when the game ends:Each line of output contains:
- Action selected — the column index (0-indexed) where the piece was dropped.
- Total playouts for next state — the playout budget the other agent will use on the resulting board.
- Value of next state according to MCTS — the win ratio (
win / N) of the chosen child node.
Evaluate Q-Learning Against MCTS
From the top-level menu choose option 2, then option 2 again to run the pre-trained Q-Learning agent against an MCTS opponent for 100 test episodes on a 3 × 5 board.This routine loads During each episode, every move is also logged:
q_data.dat.gz from the current directory, decompresses it into q_data.dat, and injects the Q-table into a fresh Q_Learning instance. After all 100 episodes it prints aggregate statistics:The
Player 1 line has no closing parenthesis and no space before the playout count — that is the exact string the source code prints: 'Player 1 (MCTS with' + str(n) + ' playouts'.This step requires
q_data.dat.gz to be present in the working directory. The file is included in the repository. If it is missing (for example after running train_qlearning() with a different board size), re-clone the repo or complete Step 5 below to generate a new one.Train Q-Learning from Scratch
From the top-level menu choose option 2, then option 1 to start a full training run. The agent plays 50,000 episodes as Player 2 against an MCTS opponent (40 playouts) on a 4 × 5 board.The episode counter prints to stdout so you can track progress:Training uses the following hyperparameters (set in When training finishes, two files are written to the working directory:
main.py):| File | Contents |
|---|---|
q_data.dat.gz | Compressed, pickled Q-value dictionary — load this for evaluation |
MCTSvsQ.jpg | 1000-episode moving-average reward convergence plot |
Training overwrites any existing
q_data.dat.gz in the current directory. If you want to preserve a previous Q-table, move or rename it before starting a new training run. The reward plot MCTSvsQ.jpg is also overwritten on each run.Next Steps
Once you have the basics running, here are a few directions to explore:Tune MCTS Playouts
Try asymmetric playout budgets in
MCTS_vs_MCTS (e.g., 10 vs 200) to quantify how much each additional playout is worth.Adjust the Reward Schedule
Experiment with the win/draw/loss/step rewards in
QLearning.py (+50, −10, −50, −1) to see whether different incentives change convergence speed or final win rate.Change the Board Size
Pass different
r and c values in train_qlearning() and MCTS_vs_Q() to explore how board dimensions affect Q-table size and training time.Swap in a Random Baseline
Replace one agent with a
Random_Player instance directly in main.py to establish a lower-bound baseline for both MCTS and Q-Learning win rates.