CUDA Programming: PMPP Book Chapter Guide

The GPU Mode lecture series is built on top of Programming Massively Parallel Processors (PMPP), the canonical textbook for GPU programming. Lecture 2 by Andreas Koepf recaps chapters 1–3, covering the conceptual foundations — thread hierarchy, memory spaces, and the SIMT execution model — while Lecture 3 by Jeremy Howard translates those foundations into working code for Python programmers.

The PMPP book

PMPP (Programming Massively Parallel Processors) by Hwu, Kirk, and Hajj is the standard reference for learning CUDA. Chapters 1–3 establish the mental model you need for everything else:

Chapter 1 — Why GPUs exist: the divergence between CPU and GPU design goals (latency vs. throughput)
Chapter 2 — CUDA programming model: kernels, threads, blocks, and grids
Chapter 3 — Memory architecture: global, shared, local, constant, and texture memory

The GPU Mode lectures assume you have access to the PMPP book or are following along with the lecture slides. The Lecture 2 slides are available publicly.

CUDA thread hierarchy

CUDA organizes threads into a three-level hierarchy: threads → blocks → grids. Understanding this hierarchy is the single most important concept in CUDA programming.

Thread

The smallest unit of execution. Each thread runs the same kernel function but operates on different data via its unique index.

Block

A group of threads that execute together on the same Streaming Multiprocessor (SM). Threads in the same block can communicate via shared memory and synchronize with __syncthreads().

Grid

A collection of blocks launched by a single kernel call. Blocks in a grid are independent — they cannot communicate directly and may execute in any order.

Thread indexing

Every thread has a unique position described by built-in variables:

Variable	Description
`threadIdx.x/y/z`	Thread’s position within its block
`blockIdx.x/y/z`	Block’s position within the grid
`blockDim.x/y/z`	Number of threads per block in each dimension
`gridDim.x/y/z`	Number of blocks in the grid in each dimension

The canonical formula for a 1D global thread index is:

int i = blockIdx.x * blockDim.x + threadIdx.x;

For 2D problems (e.g., matrix operations), you use both x and y dimensions:

int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;

SIMT execution model

GPUs execute threads in groups of 32 called warps. All threads in a warp execute the same instruction simultaneously — this is the Single Instruction, Multiple Threads (SIMT) model, analogous to SIMD on CPUs.

Warp divergence occurs when threads in the same warp take different code paths (e.g., different branches of an if statement). The GPU serializes the divergent paths, reducing parallelism. Minimize divergence within warps for best performance.

SIMT differs from SIMD in one key way: SIMT threads have their own registers and program counters, so divergence is handled automatically (though at a performance cost). The hardware masks off inactive threads and re-converges at the end of the divergent region.

Memory spaces

CUDA provides several distinct memory spaces, each with different scope, lifetime, and performance characteristics.

Global memory

The main GPU memory (device RAM). All threads in all blocks can read and write it. It has the highest capacity (gigabytes) but the highest latency (~400–800 cycles). Coalesced access patterns are critical for performance. This is what you get from cudaMalloc and PyTorch tensors.

Shared memory

On-chip memory shared by all threads in the same block. Much faster than global memory (~5 ns vs ~200+ ns) with explicit programmer control. Declared with __shared__. Capacity is limited (typically 48–96 KB per SM). Used for tiling, reductions, and any pattern requiring inter-thread communication within a block.

Registers

Per-thread private storage. The fastest memory — effectively free to access. Declared as ordinary local variables in kernels. There is a finite number per SM; spilling registers to local memory (which maps to global memory) hurts performance significantly.

Constant memory

Read-only memory cached specifically for broadcast access patterns (all threads reading the same address). Declared with __constant__. Good for kernel parameters, lookup tables, and configuration that all threads need.

Texture memory

Read-only memory with a spatial locality cache. Originally designed for graphics textures, it can provide better cache performance for 2D spatial access patterns. Less commonly used in modern compute kernels.

Local memory

Per-thread storage that physically lives in global memory. Used automatically by the compiler when a thread’s register usage exceeds the hardware limit (register spilling), or for variable-length arrays. Avoid when possible — it has global memory latency.

Kernel launch syntax

A CUDA kernel is launched from host (CPU) code using the <<<grid, block>>> syntax.

// Kernel definition — runs on the GPU
__global__ void square_kernel(float* input, float* output, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        output[i] = input[i] * input[i];
    }
}

// Host code — launches the kernel on the GPU
int n = 1024;
int threads_per_block = 256;
int num_blocks = (n + threads_per_block - 1) / threads_per_block; // ceiling division

square_kernel<<<num_blocks, threads_per_block>>>(d_input, d_output, n);

The <<<num_blocks, threads_per_block>>> syntax sets the grid dimension (number of blocks) and block dimension (threads per block). Both can be 1D, 2D, or 3D using dim3:

dim3 block(16, 16);                      // 16×16 = 256 threads per block
dim3 grid(n_cols / 16, n_rows / 16);     // 2D grid for a matrix
matrix_kernel<<<grid, block>>>(...);

Key qualifiers

Qualifier	Runs on	Called from
`__global__`	GPU	CPU (or GPU with dynamic parallelism)
`__device__`	GPU	GPU only
`__host__`	CPU	CPU only

Getting Started

CUDA Fundamentals

Advanced GPU Programming

Triton & High-Level Frameworks

Quantization & Optimization

Multi-GPU & Systems

Hardware Targets

ScaleML Series

CUDA Programming: PMPP Book Chapter Guide

The PMPP book

CUDA thread hierarchy

Thread

Block

Grid

Thread indexing

SIMT execution model

Memory spaces

Kernel launch syntax

Key qualifiers

Lecture references

Lecture 2: PMPP Chapters 1–3 Recap

Lecture 3: Getting Started with CUDA

Build docs developers (and LLMs) love

Getting Started

CUDA Fundamentals

Advanced GPU Programming

Triton & High-Level Frameworks

Quantization & Optimization

Multi-GPU & Systems

Hardware Targets

ScaleML Series

Documentation Index

​The PMPP book

​CUDA thread hierarchy

Thread

Block

Grid

​Thread indexing

​SIMT execution model

​Memory spaces

​Kernel launch syntax

​Key qualifiers

​Lecture references

Lecture 2: PMPP Chapters 1–3 Recap

Lecture 3: Getting Started with CUDA

Build docs developers (and LLMs) love

The PMPP book

CUDA thread hierarchy

Thread indexing

SIMT execution model

Memory spaces

Kernel launch syntax

Key qualifiers

Lecture references