Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/lansinuote/Simple_Reinforcement_Learning/llms.txt

Use this file to discover all available pages before exploring further.

Reinforcement learning (RL) is the branch of machine learning where an agent learns to make decisions by interacting with an environment, collecting rewards, and gradually improving its strategy over time. Unlike supervised learning, there are no labelled examples — the agent discovers what works through trial and error. This makes RL the foundation behind breakthroughs such as AlphaGo, robotic control, and large language model fine-tuning. The Simple Reinforcement Learning series was created to make these ideas accessible: every concept is implemented from scratch in a self-contained Jupyter notebook, so you can read the theory, run the code, and watch the agent improve — all in one place.

What This Course Covers

The series is organised as 18 numbered topics that build on each other progressively. You start with the simplest possible problem — choosing between slot machines — and finish with agents that must cooperate or compete with each other. Every notebook is written in plain Python using PyTorch for the neural-network layers, and OpenAI Gym for the simulation environments. The two environments you will encounter most often are CartPole-v1 (balance a pole on a moving cart, a classic discrete-action benchmark) and Pendulum-v1 (swing a pendulum upright, a continuous-action benchmark).
All notebooks were validated against Python 3.9, PyTorch 1.12.1, and Gym 0.26.2. Using different versions may produce API mismatches. See the Setup guide for exact installation instructions.

Tech Stack

ComponentVersion
Python3.9
PyTorch1.12.1
Gym0.26.2

Course Structure

The topics are stored in numbered folders inside the repository. The table below groups them into four natural phases of the curriculum.

Foundations

Topics 1–4 — Multi-armed bandits (folder 1), Markov Decision Processes (folder 2), dynamic programming (folder 3), and temporal-difference learning (folder 4). You learn how value functions and policies are defined before any neural network is involved.

Classic Control

Topics 5–8 — DynaQ model-based planning (folder 5), Deep Q-Networks including Double DQN and Dueling DQN (folder 6), REINFORCE policy gradient (folder 7), and Actor-Critic (folder 8). CartPole-v1 is solved for the first time here.

Policy Gradient Methods

Topics 10–14 — Proximal Policy Optimization (folder 10), DDPG (folder 11), Soft Actor-Critic (folder 12), imitation learning (folder 13), and offline learning (folder 14). Pendulum-v1 is introduced as the continuous-action testbed.

Advanced Topics

Topics 15–18 — Model Predictive Control (folder 15), MBPO (folder 16), goal-oriented RL (folder 17), and multi-agent systems (folder 18). These chapters connect the fundamentals to modern research directions.

Why Build From Scratch?

Many RL libraries hide the interesting parts behind high-level APIs. This course takes the opposite approach: every update rule, every replay buffer, and every neural-network architecture is written out explicitly so you can see exactly what is happening at each step. The goal is not just to run a pre-built agent, but to understand the mechanics well enough to modify them, debug them, and apply them to new problems.
If you prefer to watch before you read, the full companion video course is available on Bilibili. The videos walk through each notebook step by step and are a great complement to the written explanations.Watch on Bilibili →

Repository

All source notebooks, utility scripts, and saved checkpoints live in the GitHub repository. You can browse the numbered folders to jump directly to a topic, or clone the whole repository locally and run everything in Jupyter. View on GitHub →

Next Steps

Once you are familiar with what the course covers, head to the Setup guide to create your Python environment and install the exact package versions needed to run every notebook without errors.

Build docs developers (and LLMs) love