Reduction is one of the most fundamental parallel primitives: combine N values into one using an associative operator (sum, max, product, etc.). Lecture 9 by Mark Saroufim uses reduction as a vehicle for teaching the full spectrum of CUDA optimization techniques — from a naive implementation to a highly optimized version using shared memory and thread coarsening. The code lives inDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/gpu-mode/lectures/llms.txt
Use this file to discover all available pages before exploring further.
lecture_009/.
The reduction problem
Given an array of N floats, compute their sum. Sequentially this is O(N). In parallel, a tree reduction achieves O(log N) steps using O(N) work.Simple tree reduction
The first implementation inlecture_009/simple_reduce.cu is a direct translation of the tree structure into CUDA:
size / 2 threads per block:
Fixing divergence
lecture_009/control_divergence_reduce.cu restructures the indexing so that active threads are always at the beginning of the warp:
Shared memory reduction
Both kernels above read and write to global memory at every tree step.lecture_009/shared_reduce.cu fixes this by loading data into shared memory once:
input_s[t] = input[t] + input[t + BLOCK_DIM] loads from global memory exactly once per thread. All subsequent additions happen in shared memory (~5 ns latency vs. ~200+ ns for global). The comment in the source is honest: “L1 throughput is dramatically increased though.”
__syncthreads() is placed at the top of the loop body (before the conditional), not after it. This ensures all threads have written their result before any thread reads its neighbors — a common correctness requirement in shared memory reductions.Thread coarsening for better arithmetic intensity
The shared memory version still launches one thread per two input elements.lecture_009/reduce_coarsening.cu goes further: each thread reduces a COARSE_FACTOR number of elements in registers before contributing to the shared memory tree:
- Using multiple blocks, each reducing a chunk
- Using
atomicAddto safely accumulate partial block results into the final output
Why coarsening helps
WithCOARSE_FACTOR = 2, each thread does twice as much arithmetic work per byte loaded from global memory — arithmetic intensity doubles. The register-level accumulation is free compared to the memory access cost. Increasing COARSE_FACTOR further can push a memory-bound reduction toward the compute roof.
Evolution of the implementations
SimpleSumReductionKernel
Direct tree reduction in global memory. Divergence due to
threadIdx.x % stride == 0. High global memory traffic.FixDivergenceKernel
Reordered indexing so active threads form a contiguous prefix. Eliminates warp divergence. Still global-memory bound.
SharedMemoryReduction
Load once into shared memory, reduce entirely on-chip. Dramatically reduces global memory transactions. Limited to one block (2048 elements).
Non-determinism and accuracy
Floating-point addition is not associative:(a + b) + c ≠ a + (b + c) in general due to rounding. Parallel reductions change the order of operations relative to a serial sum, producing slightly different results. This is expected and unavoidable.
The lecture_009/nondeterminism.py and lecture_009/accuracy.py files in the lecture repository explore this. For most ML training purposes, the difference is negligible. For numerical analysis or testing exact bit-equality, use double precision or Kahan summation.
Warp shuffle intrinsics
Modern CUDA also provides warp-level reduction using__shfl_down_sync, which avoids shared memory entirely for the final warp:
__shfl_down_sync(mask, val, delta) exchanges val between lanes within a warp without going through memory — the fastest possible reduction for 32 or fewer elements.
Lecture references
Lecture 9 code
simple_reduce.cu, shared_reduce.cu, reduce_coarsening.cu, and more
Lecture 9 slides
Reductions lecture slides by Mark Saroufim