Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/gpu-mode/lectures/llms.txt

Use this file to discover all available pages before exploring further.

The GPU Mode lecture series is built on top of Programming Massively Parallel Processors (PMPP), the canonical textbook for GPU programming. Lecture 2 by Andreas Koepf recaps chapters 1–3, covering the conceptual foundations — thread hierarchy, memory spaces, and the SIMT execution model — while Lecture 3 by Jeremy Howard translates those foundations into working code for Python programmers.

The PMPP book

PMPP (Programming Massively Parallel Processors) by Hwu, Kirk, and Hajj is the standard reference for learning CUDA. Chapters 1–3 establish the mental model you need for everything else:
  • Chapter 1 — Why GPUs exist: the divergence between CPU and GPU design goals (latency vs. throughput)
  • Chapter 2 — CUDA programming model: kernels, threads, blocks, and grids
  • Chapter 3 — Memory architecture: global, shared, local, constant, and texture memory
The GPU Mode lectures assume you have access to the PMPP book or are following along with the lecture slides. The Lecture 2 slides are available publicly.

CUDA thread hierarchy

CUDA organizes threads into a three-level hierarchy: threads → blocks → grids. Understanding this hierarchy is the single most important concept in CUDA programming.

Thread

The smallest unit of execution. Each thread runs the same kernel function but operates on different data via its unique index.

Block

A group of threads that execute together on the same Streaming Multiprocessor (SM). Threads in the same block can communicate via shared memory and synchronize with __syncthreads().

Grid

A collection of blocks launched by a single kernel call. Blocks in a grid are independent — they cannot communicate directly and may execute in any order.

Thread indexing

Every thread has a unique position described by built-in variables:
VariableDescription
threadIdx.x/y/zThread’s position within its block
blockIdx.x/y/zBlock’s position within the grid
blockDim.x/y/zNumber of threads per block in each dimension
gridDim.x/y/zNumber of blocks in the grid in each dimension
The canonical formula for a 1D global thread index is:
int i = blockIdx.x * blockDim.x + threadIdx.x;
For 2D problems (e.g., matrix operations), you use both x and y dimensions:
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;

SIMT execution model

GPUs execute threads in groups of 32 called warps. All threads in a warp execute the same instruction simultaneously — this is the Single Instruction, Multiple Threads (SIMT) model, analogous to SIMD on CPUs.
Warp divergence occurs when threads in the same warp take different code paths (e.g., different branches of an if statement). The GPU serializes the divergent paths, reducing parallelism. Minimize divergence within warps for best performance.
SIMT differs from SIMD in one key way: SIMT threads have their own registers and program counters, so divergence is handled automatically (though at a performance cost). The hardware masks off inactive threads and re-converges at the end of the divergent region.

Memory spaces

CUDA provides several distinct memory spaces, each with different scope, lifetime, and performance characteristics.
The main GPU memory (device RAM). All threads in all blocks can read and write it. It has the highest capacity (gigabytes) but the highest latency (~400–800 cycles). Coalesced access patterns are critical for performance. This is what you get from cudaMalloc and PyTorch tensors.
On-chip memory shared by all threads in the same block. Much faster than global memory (~5 ns vs ~200+ ns) with explicit programmer control. Declared with __shared__. Capacity is limited (typically 48–96 KB per SM). Used for tiling, reductions, and any pattern requiring inter-thread communication within a block.
Per-thread private storage. The fastest memory — effectively free to access. Declared as ordinary local variables in kernels. There is a finite number per SM; spilling registers to local memory (which maps to global memory) hurts performance significantly.
Read-only memory cached specifically for broadcast access patterns (all threads reading the same address). Declared with __constant__. Good for kernel parameters, lookup tables, and configuration that all threads need.
Read-only memory with a spatial locality cache. Originally designed for graphics textures, it can provide better cache performance for 2D spatial access patterns. Less commonly used in modern compute kernels.
Per-thread storage that physically lives in global memory. Used automatically by the compiler when a thread’s register usage exceeds the hardware limit (register spilling), or for variable-length arrays. Avoid when possible — it has global memory latency.

Kernel launch syntax

A CUDA kernel is launched from host (CPU) code using the <<<grid, block>>> syntax.
// Kernel definition — runs on the GPU
__global__ void square_kernel(float* input, float* output, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        output[i] = input[i] * input[i];
    }
}

// Host code — launches the kernel on the GPU
int n = 1024;
int threads_per_block = 256;
int num_blocks = (n + threads_per_block - 1) / threads_per_block; // ceiling division

square_kernel<<<num_blocks, threads_per_block>>>(d_input, d_output, n);
The <<<num_blocks, threads_per_block>>> syntax sets the grid dimension (number of blocks) and block dimension (threads per block). Both can be 1D, 2D, or 3D using dim3:
dim3 block(16, 16);                      // 16×16 = 256 threads per block
dim3 grid(n_cols / 16, n_rows / 16);     // 2D grid for a matrix
matrix_kernel<<<grid, block>>>(...);

Key qualifiers

QualifierRuns onCalled from
__global__GPUCPU (or GPU with dynamic parallelism)
__device__GPUGPU only
__host__CPUCPU only

Lecture references

Lecture 2: PMPP Chapters 1–3 Recap

Slides by Andreas Koepf covering the PMPP foundations

Lecture 3: Getting Started with CUDA

Colab notebook by Jeremy Howard — run your first CUDA kernel

Build docs developers (and LLMs) love