The GPU Mode lecture series is built on top of Programming Massively Parallel Processors (PMPP), the canonical textbook for GPU programming. Lecture 2 by Andreas Koepf recaps chapters 1–3, covering the conceptual foundations — thread hierarchy, memory spaces, and the SIMT execution model — while Lecture 3 by Jeremy Howard translates those foundations into working code for Python programmers.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/gpu-mode/lectures/llms.txt
Use this file to discover all available pages before exploring further.
The PMPP book
PMPP (Programming Massively Parallel Processors) by Hwu, Kirk, and Hajj is the standard reference for learning CUDA. Chapters 1–3 establish the mental model you need for everything else:- Chapter 1 — Why GPUs exist: the divergence between CPU and GPU design goals (latency vs. throughput)
- Chapter 2 — CUDA programming model: kernels, threads, blocks, and grids
- Chapter 3 — Memory architecture: global, shared, local, constant, and texture memory
The GPU Mode lectures assume you have access to the PMPP book or are following along with the lecture slides. The Lecture 2 slides are available publicly.
CUDA thread hierarchy
CUDA organizes threads into a three-level hierarchy: threads → blocks → grids. Understanding this hierarchy is the single most important concept in CUDA programming.Thread
The smallest unit of execution. Each thread runs the same kernel function but operates on different data via its unique index.
Block
A group of threads that execute together on the same Streaming Multiprocessor (SM). Threads in the same block can communicate via shared memory and synchronize with
__syncthreads().Grid
A collection of blocks launched by a single kernel call. Blocks in a grid are independent — they cannot communicate directly and may execute in any order.
Thread indexing
Every thread has a unique position described by built-in variables:| Variable | Description |
|---|---|
threadIdx.x/y/z | Thread’s position within its block |
blockIdx.x/y/z | Block’s position within the grid |
blockDim.x/y/z | Number of threads per block in each dimension |
gridDim.x/y/z | Number of blocks in the grid in each dimension |
SIMT execution model
GPUs execute threads in groups of 32 called warps. All threads in a warp execute the same instruction simultaneously — this is the Single Instruction, Multiple Threads (SIMT) model, analogous to SIMD on CPUs. SIMT differs from SIMD in one key way: SIMT threads have their own registers and program counters, so divergence is handled automatically (though at a performance cost). The hardware masks off inactive threads and re-converges at the end of the divergent region.Memory spaces
CUDA provides several distinct memory spaces, each with different scope, lifetime, and performance characteristics.Global memory
Global memory
The main GPU memory (device RAM). All threads in all blocks can read and write it. It has the highest capacity (gigabytes) but the highest latency (~400–800 cycles). Coalesced access patterns are critical for performance. This is what you get from
cudaMalloc and PyTorch tensors.Shared memory
Shared memory
Registers
Registers
Per-thread private storage. The fastest memory — effectively free to access. Declared as ordinary local variables in kernels. There is a finite number per SM; spilling registers to local memory (which maps to global memory) hurts performance significantly.
Constant memory
Constant memory
Read-only memory cached specifically for broadcast access patterns (all threads reading the same address). Declared with
__constant__. Good for kernel parameters, lookup tables, and configuration that all threads need.Texture memory
Texture memory
Read-only memory with a spatial locality cache. Originally designed for graphics textures, it can provide better cache performance for 2D spatial access patterns. Less commonly used in modern compute kernels.
Local memory
Local memory
Per-thread storage that physically lives in global memory. Used automatically by the compiler when a thread’s register usage exceeds the hardware limit (register spilling), or for variable-length arrays. Avoid when possible — it has global memory latency.
Kernel launch syntax
A CUDA kernel is launched from host (CPU) code using the<<<grid, block>>> syntax.
<<<num_blocks, threads_per_block>>> syntax sets the grid dimension (number of blocks) and block dimension (threads per block). Both can be 1D, 2D, or 3D using dim3:
Key qualifiers
| Qualifier | Runs on | Called from |
|---|---|---|
__global__ | GPU | CPU (or GPU with dynamic parallelism) |
__device__ | GPU | GPU only |
__host__ | CPU | CPU only |
Lecture references
Lecture 2: PMPP Chapters 1–3 Recap
Slides by Andreas Koepf covering the PMPP foundations
Lecture 3: Getting Started with CUDA
Colab notebook by Jeremy Howard — run your first CUDA kernel