Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/marshalharman/QLearning_and_MCTS-Reinforcement_Learning/llms.txt

Use this file to discover all available pages before exploring further.

The project provides two evaluation modes: MCTS vs MCTS and MCTS vs Q-Learning. Both modes print per-turn details — the action chosen, the MCTS visit count or Q-value, and the game outcome. The MCTS vs Q-Learning mode additionally reports aggregate win, draw, and loss statistics at the end of the 100-game run.

Mode 1: MCTS vs MCTS

The MCTS_vs_MCTS(x, y) function pits two MCTS agents against each other on a 6×5 board for a single game (num_iterations = 1). Both agents share the same UCB1 constant (C_MCTS = 2) but can use different playout budgets. Function signature:
MCTS_vs_MCTS(x, y)
# x = number of playouts for player 1
# y = number of playouts for player 2
Per-turn output printed to the console each move:
  • Which player just moved and their playout budget
  • The column index selected (Action selected)
  • Total playouts available for the next state
  • The win-ratio value (win / N) of the chosen child node according to MCTS
Terminal output at game end:
PLAYER 1 WINS :   (or PLAYER 2 WINS : / DRAW :)
0 0 0 0 0
0 0 0 0 0
0 0 1 0 0
0 2 1 0 0
0 2 1 0 0
0 2 1 0 0
The final board state is printed via print_grid() so you can visually confirm the winning four-in-a-row.

How to Run MCTS vs MCTS

python main.py
Choose from the following:
 1. MCTS agent vs MCTS agent
 2. MCTS agent vs Q-Learning agent
1
Enter number of playouts for player 1:
10
Enter number of playouts for player 2:
10
You can enter any integer for each player’s playout count. Setting one player higher (e.g., 100 vs 10) will generally produce a stronger player 1 at the cost of longer compute per move.

Mode 2: MCTS vs Q-Learning

The MCTS_vs_Q() function evaluates a pre-trained Q-Learning agent against an MCTS opponent over 100 test games. It requires q_data.dat.gz in the working directory. Key parameters used in MCTS_vs_Q():
ParameterValueNotes
testing_iter100Number of evaluation games
n (MCTS playouts)10MCTS opponent strength
C_MCTS2UCB1 exploration constant
rows3Board rows (smaller than training’s 4)
cols5Board columns
alpha0Learning disabled during evaluation
epsilon0Fully greedy policy at test time
The Q-Learning agent’s learning rate and exploration are both set to zero so the Q-table is frozen: no updates occur during testing. Per-game console output:
Player 1 (MCTS with10 playouts
Action selected : 2
Value of next state according to MCTS : 0.43

Player 2 (Q-learning)
Action selected : 3
Value of next state : 12.5
Aggregate output after all 100 games:
Draws: 0.12
Accuracy of P1, MCTS10: 0.67
Accuracy of P2, Q_Learning: 0.21

How to Run MCTS vs Q-Learning

python main.py
Choose from the following:
 1. MCTS agent vs MCTS agent
 2. MCTS agent vs Q-Learning agent
2
Choose from the following:
 1. Train Q-Learning against MCTS agent
 2. Test Q-Learning against MCTS
2
The script automatically decompresses q_data.dat.gzq_data.dat and loads the Q-table before starting the evaluation loop.

Interpreting Results

MCTS value (win / N)

Each MCTS child node tracks win (cumulative score across simulations, ranging from −N to +N) and N (visit count). The reported value is win / N:
  • Positive (e.g., 0.6): the subtree rooted at this move tends to produce wins for the current player.
  • Near zero: roughly neutral — wins and losses cancel out across simulations.
  • Negative (e.g., −0.3): the subtree tends to produce losses; the agent avoided this move via UCB1.

Q-Learning action value Q(s, a)

The value printed for Q-Learning moves is the stored Q(s, a) for the chosen state-action pair:
  • Large positive (e.g., 40): the agent strongly expects future rewards — likely near a win or has seen this position lead to wins frequently.
  • Near zero or slightly negative: the agent has little information about this state, or the expected return is marginal.
  • Large negative (e.g., −45): the agent expects this line of play to end in a loss or draw.

Sample aggregate statistics

The three numbers printed at the end sum to 1.0:
Draws + P1 accuracy + P2 accuracy = 1.0
A well-trained Q-Learning agent should push its accuracy above the draw rate and, ideally, above the MCTS win rate over 100 games.
The bundled q_data.dat.gz was trained against MCTS with n=10 (MC10), as the source notes: “Convergence till n=10 can be tested as Q-Learning is trained against MC10. Higher values can be tested but the results will show slightly less win percentage due to lack of training for higher values.” The evaluation function uses a 3×5 board while train_qlearning() trains on a 4×5 board (r=4), so evaluation runs on a slightly smaller state space than training.

Extending Evaluation

Comparing multiple playout counts (MCTS vs MCTS)

To benchmark MCTS strength across playout budgets, wrap MCTS_vs_MCTS in a loop and increase num_iterations inside the function:
for x in [5, 10, 25, 50]:
    MCTS_vs_MCTS(x, 10)
Collect the winsP1 and draws lists returned (or printed) by each call to build a performance curve.

Evaluating Q-Learning on a larger board

The 3×5 test board is smaller than the 4×5 training board. To evaluate on the full training board or the 6×5 MCTS board:
  1. Open main.py and change rows and cols in MCTS_vs_Q() to match your desired dimensions.
  2. Retrain on the same board dimensions so the Q-table keys correspond to the new state encoding.
  3. Re-run evaluation with the updated q_data.dat.gz.

Build docs developers (and LLMs) love