Parallel Reductions in CUDA: Patterns and Optimization

Reduction is one of the most fundamental parallel primitives: combine N values into one using an associative operator (sum, max, product, etc.). Lecture 9 by Mark Saroufim uses reduction as a vehicle for teaching the full spectrum of CUDA optimization techniques — from a naive implementation to a highly optimized version using shared memory and thread coarsening. The code lives in lecture_009/.

The reduction problem

Given an array of N floats, compute their sum. Sequentially this is O(N). In parallel, a tree reduction achieves O(log N) steps using O(N) work.

Input:  [1, 2, 3, 4, 5, 6, 7, 8]

Step 1: [3, _, 7, _, 11, _, 15, _]   (add pairs)
Step 2: [10, _, _, _, 26, _, _, _]   (add pairs of pairs)
Step 3: [36, _, _, _, _, _, _, _]    (final sum)

The challenge in CUDA is mapping this tree structure onto the thread/block hierarchy efficiently, minimizing global memory traffic, and avoiding warp divergence.

Simple tree reduction

The first implementation in lecture_009/simple_reduce.cu is a direct translation of the tree structure into CUDA:

__global__ void SimpleSumReductionKernel(float* input, float* output) {
    unsigned int i = 2 * threadIdx.x;
    for (unsigned int stride = 1; stride <= blockDim.x; stride *= 2) {
        if (threadIdx.x % stride == 0) {
            input[i] += input[i + stride];
        }
        __syncthreads();
    }
    if (threadIdx.x == 0) {
        *output = input[0];
    }
}

Launch with size / 2 threads per block:

SimpleSumReductionKernel<<<1, size / 2>>>(d_input, d_output);

This kernel has two problems. First, threadIdx.x % stride == 0 causes severe warp divergence: in later iterations, fewer and fewer threads do useful work, but the warp still executes all branches. Second, it reads and writes directly to global memory at every step.

Fixing divergence

lecture_009/control_divergence_reduce.cu restructures the indexing so that active threads are always at the beginning of the warp:

__global__ void FixDivergenceKernel(float* input, float* output) {
    unsigned int i = threadIdx.x;  // threads start next to each other
    for (unsigned int stride = blockDim.x; stride >= 1; stride /= 2) {
        if (threadIdx.x < stride) {  // active threads form a contiguous prefix
            input[i] += input[i + stride];
        }
        __syncthreads();
    }
    if (threadIdx.x == 0) {
        *output = input[0];
    }
}

Now thread 0 is always active, thread 1 is active for the first half of iterations, and so on. Whole warps go idle cleanly rather than half-executing on every iteration.

Shared memory reduction

Both kernels above read and write to global memory at every tree step. lecture_009/shared_reduce.cu fixes this by loading data into shared memory once:

#define BLOCK_DIM 1024

__global__ void SharedMemoryReduction(float* input, float* output) {
    __shared__ float input_s[BLOCK_DIM];
    unsigned int t = threadIdx.x;

    // Load two elements from global memory, sum them into shared memory
    input_s[t] = input[t] + input[t + BLOCK_DIM];

    // Tree reduction entirely in on-chip shared memory
    for (unsigned int stride = blockDim.x / 2; stride >= 1; stride /= 2) {
        __syncthreads();
        if (threadIdx.x < stride) {
            input_s[t] += input_s[t + stride];
        }
    }

    if (threadIdx.x == 0) {
        *output = input_s[0];
    }
}

The key change: input_s[t] = input[t] + input[t + BLOCK_DIM] loads from global memory exactly once per thread. All subsequent additions happen in shared memory (~5 ns latency vs. ~200+ ns for global). The comment in the source is honest: “L1 throughput is dramatically increased though.”

__syncthreads() is placed at the top of the loop body (before the conditional), not after it. This ensures all threads have written their result before any thread reads its neighbors — a common correctness requirement in shared memory reductions.

Thread coarsening for better arithmetic intensity

The shared memory version still launches one thread per two input elements. lecture_009/reduce_coarsening.cu goes further: each thread reduces a COARSE_FACTOR number of elements in registers before contributing to the shared memory tree:

#define BLOCK_DIM 1024
#define COARSE_FACTOR 2

__global__ void CoarsenedReduction(float* input, float* output, int size) {
    __shared__ float input_s[BLOCK_DIM];

    unsigned int i = blockIdx.x * blockDim.x * COARSE_FACTOR + threadIdx.x;
    unsigned int t = threadIdx.x;
    float sum = 0.0f;

    // Phase 1: reduce COARSE_FACTOR elements per thread in registers
    for (unsigned int tile = 0; tile < COARSE_FACTOR; ++tile) {
        unsigned int index = i + tile * blockDim.x;
        if (index < size) {
            sum += input[index];
        }
    }

    // Phase 2: contribute to shared memory reduction
    input_s[t] = sum;
    __syncthreads();

    // Phase 3: tree reduction in shared memory
    for (unsigned int stride = blockDim.x / 2; stride > 0; stride >>= 1) {
        if (t < stride) {
            input_s[t] += input_s[t + stride];
        }
        __syncthreads();
    }

    // Phase 4: one thread per block writes partial result to global memory
    if (t == 0) {
        atomicAdd(output, input_s[0]);
    }
}

Launch configuration with coarsening:

int numBlocks = (size + BLOCK_DIM * COARSE_FACTOR - 1) / (BLOCK_DIM * COARSE_FACTOR);
CoarsenedReduction<<<numBlocks, BLOCK_DIM>>>(d_input, d_output, size);

This version handles arbitrarily large arrays (not just 2048 elements) by:

Using multiple blocks, each reducing a chunk
Using atomicAdd to safely accumulate partial block results into the final output

Why coarsening helps

With COARSE_FACTOR = 2, each thread does twice as much arithmetic work per byte loaded from global memory — arithmetic intensity doubles. The register-level accumulation is free compared to the memory access cost. Increasing COARSE_FACTOR further can push a memory-bound reduction toward the compute roof.

Evolution of the implementations

SimpleSumReductionKernel

Direct tree reduction in global memory. Divergence due to threadIdx.x % stride == 0. High global memory traffic.

FixDivergenceKernel

Reordered indexing so active threads form a contiguous prefix. Eliminates warp divergence. Still global-memory bound.

SharedMemoryReduction

Load once into shared memory, reduce entirely on-chip. Dramatically reduces global memory transactions. Limited to one block (2048 elements).

CoarsenedReduction

Multi-block, handles arbitrary sizes. Thread coarsening in registers before shared memory phase. atomicAdd for inter-block accumulation.

Non-determinism and accuracy

Floating-point addition is not associative: (a + b) + c ≠ a + (b + c) in general due to rounding. Parallel reductions change the order of operations relative to a serial sum, producing slightly different results. This is expected and unavoidable. The lecture_009/nondeterminism.py and lecture_009/accuracy.py files in the lecture repository explore this. For most ML training purposes, the difference is negligible. For numerical analysis or testing exact bit-equality, use double precision or Kahan summation.

Use torch.allclose(a, b, rtol=1e-5, atol=1e-8) rather than torch.equal(a, b) when validating custom reduction kernels against PyTorch’s reference implementation.

Warp shuffle intrinsics

Modern CUDA also provides warp-level reduction using __shfl_down_sync, which avoids shared memory entirely for the final warp:

// Warp-level reduction using shuffle — no shared memory needed
__device__ float warpReduceSum(float val) {
    for (int offset = warpSize / 2; offset > 0; offset /= 2) {
        val += __shfl_down_sync(0xffffffff, val, offset);
    }
    return val;
}

__shfl_down_sync(mask, val, delta) exchanges val between lanes within a warp without going through memory — the fastest possible reduction for 32 or fewer elements.

Getting Started

CUDA Fundamentals

Advanced GPU Programming

Triton & High-Level Frameworks

Quantization & Optimization

Multi-GPU & Systems

Hardware Targets

ScaleML Series

Parallel Reductions in CUDA: Patterns and Optimization

The reduction problem

Simple tree reduction

Fixing divergence

Shared memory reduction

Thread coarsening for better arithmetic intensity

Why coarsening helps

Evolution of the implementations

Non-determinism and accuracy

Warp shuffle intrinsics

Lecture references

Lecture 9 code

Lecture 9 slides

Build docs developers (and LLMs) love

Getting Started

CUDA Fundamentals

Advanced GPU Programming

Triton & High-Level Frameworks

Quantization & Optimization

Multi-GPU & Systems

Hardware Targets

ScaleML Series

Documentation Index

​The reduction problem

​Simple tree reduction

​Fixing divergence

​Shared memory reduction

​Thread coarsening for better arithmetic intensity

​Why coarsening helps

​Evolution of the implementations

​Non-determinism and accuracy

​Warp shuffle intrinsics

​Lecture references

Lecture 9 code

Lecture 9 slides

Build docs developers (and LLMs) love

The reduction problem

Simple tree reduction

Fixing divergence

Shared memory reduction

Thread coarsening for better arithmetic intensity

Why coarsening helps

Evolution of the implementations

Non-determinism and accuracy

Warp shuffle intrinsics

Lecture references