Reinforcement learning (RL) is the branch of machine learning where an agent learns to make decisions by interacting with an environment, collecting rewards, and gradually improving its strategy over time. Unlike supervised learning, there are no labelled examples — the agent discovers what works through trial and error. This makes RL the foundation behind breakthroughs such as AlphaGo, robotic control, and large language model fine-tuning. The Simple Reinforcement Learning series was created to make these ideas accessible: every concept is implemented from scratch in a self-contained Jupyter notebook, so you can read the theory, run the code, and watch the agent improve — all in one place.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/lansinuote/Simple_Reinforcement_Learning/llms.txt
Use this file to discover all available pages before exploring further.
What This Course Covers
The series is organised as 18 numbered topics that build on each other progressively. You start with the simplest possible problem — choosing between slot machines — and finish with agents that must cooperate or compete with each other. Every notebook is written in plain Python using PyTorch for the neural-network layers, and OpenAI Gym for the simulation environments. The two environments you will encounter most often are CartPole-v1 (balance a pole on a moving cart, a classic discrete-action benchmark) and Pendulum-v1 (swing a pendulum upright, a continuous-action benchmark).All notebooks were validated against Python 3.9, PyTorch 1.12.1, and Gym 0.26.2. Using different versions may produce API mismatches. See the Setup guide for exact installation instructions.
Tech Stack
| Component | Version |
|---|---|
| Python | 3.9 |
| PyTorch | 1.12.1 |
| Gym | 0.26.2 |
Course Structure
The topics are stored in numbered folders inside the repository. The table below groups them into four natural phases of the curriculum.Foundations
Topics 1–4 — Multi-armed bandits (folder 1), Markov Decision Processes (folder 2), dynamic programming (folder 3), and temporal-difference learning (folder 4). You learn how value functions and policies are defined before any neural network is involved.
Classic Control
Topics 5–8 — DynaQ model-based planning (folder 5), Deep Q-Networks including Double DQN and Dueling DQN (folder 6), REINFORCE policy gradient (folder 7), and Actor-Critic (folder 8). CartPole-v1 is solved for the first time here.
Policy Gradient Methods
Topics 10–14 — Proximal Policy Optimization (folder 10), DDPG (folder 11), Soft Actor-Critic (folder 12), imitation learning (folder 13), and offline learning (folder 14). Pendulum-v1 is introduced as the continuous-action testbed.
Advanced Topics
Topics 15–18 — Model Predictive Control (folder 15), MBPO (folder 16), goal-oriented RL (folder 17), and multi-agent systems (folder 18). These chapters connect the fundamentals to modern research directions.