The project provides two evaluation modes: MCTS vs MCTS and MCTS vs Q-Learning. Both modes print per-turn details — the action chosen, the MCTS visit count or Q-value, and the game outcome. The MCTS vs Q-Learning mode additionally reports aggregate win, draw, and loss statistics at the end of the 100-game run.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/marshalharman/QLearning_and_MCTS-Reinforcement_Learning/llms.txt
Use this file to discover all available pages before exploring further.
Mode 1: MCTS vs MCTS
TheMCTS_vs_MCTS(x, y) function pits two MCTS agents against each other on a 6×5 board for a single game (num_iterations = 1). Both agents share the same UCB1 constant (C_MCTS = 2) but can use different playout budgets.
Function signature:
- Which player just moved and their playout budget
- The column index selected (
Action selected) - Total playouts available for the next state
- The win-ratio value (
win / N) of the chosen child node according to MCTS
print_grid() so you can visually confirm the winning four-in-a-row.
How to Run MCTS vs MCTS
Mode 2: MCTS vs Q-Learning
TheMCTS_vs_Q() function evaluates a pre-trained Q-Learning agent against an MCTS opponent over 100 test games. It requires q_data.dat.gz in the working directory.
Key parameters used in MCTS_vs_Q():
| Parameter | Value | Notes |
|---|---|---|
testing_iter | 100 | Number of evaluation games |
n (MCTS playouts) | 10 | MCTS opponent strength |
C_MCTS | 2 | UCB1 exploration constant |
rows | 3 | Board rows (smaller than training’s 4) |
cols | 5 | Board columns |
alpha | 0 | Learning disabled during evaluation |
epsilon | 0 | Fully greedy policy at test time |
How to Run MCTS vs Q-Learning
q_data.dat.gz → q_data.dat and loads the Q-table before starting the evaluation loop.
Interpreting Results
MCTS value (win / N)
Each MCTS child node tracks win (cumulative score across simulations, ranging from −N to +N) and N (visit count). The reported value is win / N:
- Positive (e.g.,
0.6): the subtree rooted at this move tends to produce wins for the current player. - Near zero: roughly neutral — wins and losses cancel out across simulations.
- Negative (e.g.,
−0.3): the subtree tends to produce losses; the agent avoided this move via UCB1.
Q-Learning action value Q(s, a)
The value printed for Q-Learning moves is the stored Q(s, a) for the chosen state-action pair:
- Large positive (e.g.,
40): the agent strongly expects future rewards — likely near a win or has seen this position lead to wins frequently. - Near zero or slightly negative: the agent has little information about this state, or the expected return is marginal.
- Large negative (e.g.,
−45): the agent expects this line of play to end in a loss or draw.
Sample aggregate statistics
The three numbers printed at the end sum to 1.0:The bundled
q_data.dat.gz was trained against MCTS with n=10 (MC10), as the source notes: “Convergence till n=10 can be tested as Q-Learning is trained against MC10. Higher values can be tested but the results will show slightly less win percentage due to lack of training for higher values.” The evaluation function uses a 3×5 board while train_qlearning() trains on a 4×5 board (r=4), so evaluation runs on a slightly smaller state space than training.Extending Evaluation
Comparing multiple playout counts (MCTS vs MCTS)
To benchmark MCTS strength across playout budgets, wrapMCTS_vs_MCTS in a loop and increase num_iterations inside the function:
winsP1 and draws lists returned (or printed) by each call to build a performance curve.
Evaluating Q-Learning on a larger board
The 3×5 test board is smaller than the 4×5 training board. To evaluate on the full training board or the 6×5 MCTS board:- Open
main.pyand changerowsandcolsinMCTS_vs_Q()to match your desired dimensions. - Retrain on the same board dimensions so the Q-table keys correspond to the new state encoding.
- Re-run evaluation with the updated
q_data.dat.gz.