TheDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/marshalharman/QLearning_and_MCTS-Reinforcement_Learning/llms.txt
Use this file to discover all available pages before exploring further.
train_qlearning() function runs 50,000 self-play episodes where a Q-Learning agent (player 2) learns by playing against an MCTS agent (player 1). Each episode starts from a clean board, and the Q-table grows incrementally across all episodes. Once all episodes complete, the Q-table is serialised to disk in both raw and compressed form.
Training Setup
All key hyperparameters are declared at the top oftrain_qlearning() in main.py:
| Parameter | Value | Role |
|---|---|---|
r | 4 | Board rows |
c | 5 | Board columns |
num_iterations | 50,000 | Total training episodes |
n (MCTS playouts) | 40 | Strength of the MCTS opponent per move |
C_MCTS | 2 | UCB1 exploration constant for MCTS |
alpha | 0.6 | Q-Learning rate |
discount_factor | 0.8 | Gamma — discount on future rewards |
epsilon | 0.1 | ε-greedy exploration probability |
q_values dict is passed by reference so Q-table updates persist across the entire run.
The Training Loop
The loop inmain.py follows a standard alternating-turn structure:
turn == 0, Q-Learning moves on turn == 1. After every episode, the cumulative reward earned by player2 is appended to the rewards list for later plotting.
Terminal Reward Handling
When a game ends, theend flag is set to True and a terminal reward must be pushed back into the Q-table before the episode closes:
- MCTS wins (
turn == 1,result == "win"): setplayer2.game_status = "loss"and callplayer2.take_action()once more. This triggers the Q-update path that applies the −50 loss penalty to the last Q-Learning state-action pair. - Q-Learning wins (
turn == 0,result == "win"): no extra call needed — the winning move is already processed with the +50 reward insidetake_action()itself. - Draw (either turn): set
player2.game_status = "draw"and callplayer2.take_action()to propagate the −10 draw penalty.
Convergence Monitoring
After all episodes finish,train_qlearning() computes a rolling moving average of rewards using mAverage(rewards, 1000) — a window of 1,000 episodes — and saves the result as MCTSvsQ.jpg:
MCTSvsQ.jpg is saved automatically at the end of training. Open it to assess convergence — the moving-average reward should trend upward over episodes as the agent learns to avoid losses. A flat or declining curve indicates the learning rate or epsilon may need adjustment.Artifact Outputs
Two files are written to the working directory on completion:| File | Format | Purpose |
|---|---|---|
q_data.dat | Raw pickle | Python dict of state-action → Q-value mappings |
q_data.dat.gz | gzip-compressed pickle | Portable, space-efficient copy for distribution |
.gz version is what MCTS_vs_Q() loads during evaluation.
How to Run Training
Launchmain.py and follow the interactive prompts:
i) for each completed game, along with P1 win, P2 win, or Draw results.
Resuming or Extending Training
The Q-table is passed by reference between episodes, so any values accumulated during a previous run can be loaded and continued. The mechanism is already wired into the source:- At the start of
train_qlearning(),q_valuesis initialised as an emptydict. - Before the loop,
player2.set_Qvalues(q_values)binds the agent to that dict. - To resume from a saved checkpoint, decompress
q_data.dat.gz, unpickle the dict, and assign it toq_valuesbefore the loop begins: