Reinforcement learning agents for games and navigation

Reinforcement learning (RL) frames intelligence as an agent interacting with an environment: at each step the agent observes a state, selects an action, and receives a reward signal. Over many episodes the agent learns a policy that maximises cumulative reward—without any labelled training data. This repository implements six RL projects spanning classic arcade games, procedurally generated mazes, and open-ended task automation, giving a practical progression from simple Q-learning grids to pixel-based deep Q-networks running inside real game emulators.

Project 80 – Flappy Bird Agent

Objective: Train an agent to play Flappy Bird autonomously by learning when to flap to navigate through pipe gaps, maximising the distance traveled.Algorithm: Deep Q-Network (DQN). The agent observes a compact state vector (bird y-position, vertical velocity, distance to next pipe, gap position) and outputs a binary action: flap or do nothing.Environment: Custom Flappy Bird simulation (e.g., pygame-based or flappy-bird-gym).Framework: PyTorch or TensorFlow with a replay buffer and target network for stable Q-value updates.Key Technique: Experience replay + epsilon-greedy exploration annealing.How to Run:

cd 80_Flappy_bird_Agent
pip install -r requirements.txt
python SRC/App.py

Project 81 – Mario Playing RL Agent

Objective: Train an agent to complete levels of Super Mario Bros by moving right, jumping over enemies, and collecting rewards.Algorithm: Proximal Policy Optimization (PPO) or DQN operating on raw pixel frames pre-processed into grayscale stacks.Environment: gym-super-mario-bros wrapping the NES emulator via nes-py. Observation space is an 84×84×4 stacked grayscale frame tensor.Framework: stable-baselines3 (PPO) or custom PyTorch DQN with convolutional feature extractor.Key Technique: Frame stacking (4 consecutive frames) to encode motion; reward shaping based on x-position delta and time penalty.How to Run:

cd 81_Mario-playing_RL_Agent
pip install -r requirements.txt
python SRC/App.py

Project 83 – Pong with Double DQN

Objective: Train an agent to defeat the built-in opponent in Atari Pong using a Double DQN to reduce Q-value overestimation.Algorithm: Double DQN (DDQN). Unlike standard DQN, action selection and Q-value evaluation use separate networks (online and target), decoupling these two correlated operations and improving convergence stability.Environment: ALE/Pong-v5 via gymnasium[atari]. Observations are 210×160×3 RGB frames, pre-processed to 84×84 grayscale stacks of 4.Framework: PyTorch. Replay buffer stores (state, action, reward, next_state, done) tuples; target network weights are synced every N steps.Key Technique: Double Q-learning update rule, prioritised or uniform experience replay.How to Run:

cd 83_Pong_DDQN
pip install -r requirements.txt
python SRC/App.py

Project 84 – Breakout with DQN

Objective: Train an agent to play Atari Breakout, learning to bounce the ball to break bricks and maximise the score across multiple lives.Algorithm: DQN with convolutional neural network (CNN) as the Q-function approximator—the canonical architecture from the DeepMind 2015 Nature paper.Environment: ALE/Breakout-v5 via gymnasium[atari]. Four-frame grayscale stacks at 84×84 resolution.Framework: PyTorch. Replay memory of 100 k–1 M transitions; epsilon decays from 1.0 to 0.01 over the first million steps.Key Technique: CNN feature extraction (3 conv layers + 2 FC layers), frame skipping (action repeated every 4 frames), reward clipping to ±1.How to Run:

cd 84_Breakout_DQN
pip install -r requirements.txt
python SRC/App.py

Project 85 – Maze Solver RL

Objective: Train an agent to navigate from a start cell to a goal cell in a grid maze using the shortest possible path, without being given the maze layout in advance.Algorithm: Tabular Q-learning for small discrete mazes; DQN for larger or procedurally generated mazes where the state space is too large for a Q-table.Environment: Custom grid-world environment. States are (row, col) coordinates; actions are . Reward: +10 on reaching the goal, −1 per step, −5 for hitting a wall.Framework: NumPy (tabular) or PyTorch (DQN variant).Key Technique: Epsilon-greedy exploration; for DQN variant, the state is encoded as a flattened one-hot grid or a 2-D occupancy map passed through a small CNN.How to Run:

cd 85_Maze_Solver_RL
pip install -r requirements.txt
python SRC/App.py

Project 86 – AI Personal Agent

Objective: Build an autonomous agent that can break down a high-level user goal into sub-tasks, call tools (web search, file I/O, code execution), and iterate until the goal is complete.Algorithm: LLM-based policy (e.g., GPT-4 or open-source equivalent) wrapped in a ReAct (Reasoning + Acting) loop. The agent alternates between a Thought step (chain-of-thought reasoning), an Action step (tool call), and an Observation step (tool result) until it outputs a final answer.Environment: Open-ended task space defined by the user’s prompt. Tools available to the agent may include web search, Python REPL, file reader, and API callers.Framework: LangChain or a custom agent loop; tool results are appended to the context window at each step.Key Technique: ReAct prompting, tool-use via function calling, memory management to stay within context limits.How to Run:

cd 86_AI_Personal_Agent
pip install -r requirements.txt
python SRC/App.py

DQN training loop

The following snippet shows a standard DQN training loop—the core pattern shared by Projects 80, 83, and 84. It covers environment stepping, replay buffer sampling, the Bellman update, and target network synchronisation.

import random
from collections import deque

import gymnasium as gym
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim

# --- 1. Q-Network ---
class DQN(nn.Module):
    def __init__(self, state_dim: int, action_dim: int):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, 128), nn.ReLU(),
            nn.Linear(128, 128),       nn.ReLU(),
            nn.Linear(128, action_dim),
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.net(x)

# --- 2. Hyperparameters ---
ENV_NAME    = "CartPole-v1"   # swap for Atari env + wrappers
GAMMA       = 0.99
LR          = 1e-3
BATCH_SIZE  = 64
BUFFER_SIZE = 10_000
SYNC_EVERY  = 500             # steps between target network updates
EPSILON_START, EPSILON_END, EPSILON_DECAY = 1.0, 0.05, 5_000

env = gym.make(ENV_NAME)
state_dim  = env.observation_space.shape[0]
action_dim = env.action_space.n

online_net = DQN(state_dim, action_dim)
target_net = DQN(state_dim, action_dim)
target_net.load_state_dict(online_net.state_dict())
target_net.eval()

optimizer  = optim.Adam(online_net.parameters(), lr=LR)
replay     = deque(maxlen=BUFFER_SIZE)
total_steps = 0

# --- 3. Training loop ---
for episode in range(500):
    state, _ = env.reset()
    done = False
    ep_reward = 0.0

    while not done:
        # Epsilon-greedy action selection
        epsilon = EPSILON_END + (EPSILON_START - EPSILON_END) * \
                  np.exp(-total_steps / EPSILON_DECAY)
        if random.random() < epsilon:
            action = env.action_space.sample()
        else:
            with torch.no_grad():
                q = online_net(torch.tensor(state, dtype=torch.float32))
                action = q.argmax().item()

        next_state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated
        replay.append((state, action, reward, next_state, float(done)))
        state = next_state
        ep_reward += reward
        total_steps += 1

        # Learn once the buffer has enough samples
        if len(replay) >= BATCH_SIZE:
            batch = random.sample(replay, BATCH_SIZE)
            s, a, r, ns, d = zip(*batch)

            s  = torch.tensor(np.array(s),  dtype=torch.float32)
            a  = torch.tensor(a,             dtype=torch.long).unsqueeze(1)
            r  = torch.tensor(r,             dtype=torch.float32).unsqueeze(1)
            ns = torch.tensor(np.array(ns), dtype=torch.float32)
            d  = torch.tensor(d,             dtype=torch.float32).unsqueeze(1)

            # Bellman target (Double DQN variant: select action with online, evaluate with target)
            with torch.no_grad():
                best_actions = online_net(ns).argmax(1, keepdim=True)
                target_q     = r + GAMMA * (1 - d) * target_net(ns).gather(1, best_actions)

            current_q = online_net(s).gather(1, a)
            loss = nn.functional.mse_loss(current_q, target_q)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        # Sync target network
        if total_steps % SYNC_EVERY == 0:
            target_net.load_state_dict(online_net.state_dict())

    print(f"Episode {episode+1:4d} | reward: {ep_reward:7.1f} | ε: {epsilon:.3f}")

env.close()

Project comparison

Project	Algorithm	Environment	Framework	Key Technique
80 – Flappy Bird	DQN	Custom pygame / flappy-bird-gym	PyTorch	Replay buffer, target network
81 – Mario RL Agent	PPO / DQN	gym-super-mario-bros	stable-baselines3 / PyTorch	Frame stacking, reward shaping
83 – Pong DDQN	Double DQN	ALE/Pong-v5 (Gymnasium)	PyTorch	Decoupled action selection & evaluation
84 – Breakout DQN	DQN (CNN)	ALE/Breakout-v5 (Gymnasium)	PyTorch	Conv feature extractor, frame skip
85 – Maze Solver	Q-learning / DQN	Custom grid-world	NumPy / PyTorch	Tabular or deep Q-table, step penalty
86 – AI Personal Agent	ReAct (LLM policy)	Open-ended task space	LangChain	Tool-use, chain-of-thought reasoning

The game-playing projects (80, 81, 83, 84) depend on specific environment packages. Install them before running:

pip install gymnasium[atari] ale-py          # Atari games (Pong, Breakout)
pip install gym-super-mario-bros nes-py      # Mario
pip install stable-baselines3                # PPO and other algorithms
pip install flappy-bird-gym                  # Flappy Bird

Atari environments additionally require the Atari ROM files. Follow the ale-py documentation to import ROMs legally using ale-import-roms.

Training pixel-based RL agents (Projects 80, 81, 83, 84) is computationally intensive. A CUDA-enabled GPU reduces training time from days to hours. If a GPU is not available, reduce the replay buffer size, lower the target resolution, or use a pre-built stable-baselines3 checkpoint as a starting point. For Projects 85 (grid maze) and 86 (personal agent), a CPU is sufficient—tabular Q-learning and LLM API calls do not require local GPU resources.

Supervised Learning

Unsupervised & Vision

NLP & Generative AI

Time Series & Reinforcement Learning

Reinforcement learning agents for games and navigation

DQN training loop

Project comparison

Build docs developers (and LLMs) love

Supervised Learning

Unsupervised & Vision

NLP & Generative AI

Time Series & Reinforcement Learning

Documentation Index

​DQN training loop

​Project comparison

Build docs developers (and LLMs) love

DQN training loop

Project comparison