Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/gpu-mode/lectures/llms.txt

Use this file to discover all available pages before exploring further.

The prefix scan (prefix sum) is one of the fundamental building blocks of parallel computing. Nearly every GPU algorithm that requires data-dependent indexing — stream compaction, sorting, histogram construction, recurrent neural networks — relies on an efficient scan implementation. This page covers Lectures 20 and 21 by Izzat El Haj and Lecture 24 by Jake Hemstad and Georgii Evtushenko.

The scan problem

A prefix scan takes an array of values and an associative binary operator (typically addition) and produces an output array where each element is the accumulated result of all preceding elements: Inclusive scan (each output includes the current element):
Input:   [3, 1, 7, 0, 4, 1, 6, 3]
Output:  [3, 4, 11, 11, 15, 16, 22, 25]
Exclusive scan (each output excludes the current element, shifted by one):
Input:   [3, 1, 7, 0, 4, 1, 6, 3]
Output:  [0, 3, 4, 11, 11, 15, 16, 22]
Scan generalizes beyond addition: any associative operator works (max, min, logical AND, bitwise OR, matrix multiplication). The GPU-Mode lectures focus on additive scan as the canonical case.

Sequential scan

A sequential scan is straightforward: iterate over the array, accumulating as you go.
void sequential_scan(int *input, int *output, int n) {
    output[0] = input[0];
    for (int i = 1; i < n; i++) {
        output[i] = output[i-1] + input[i];
    }
}
This runs in O(N) time with O(N) work. Parallelizing scan is non-trivial because each output depends on all previous outputs — a data dependency chain that limits naive parallelism.

Hillis-Steele (naive parallel) scan

The Hillis-Steele algorithm (1986) achieves O(log N) span (parallel steps) but uses O(N log N) total work — more work than the sequential algorithm. Algorithm: In each step d (from 1 to log₂N), each element i ≥ 2^(d-1) adds the element 2^(d-1) positions to its left:
Step 1 (d=1): Each element adds element 1 position left
Step 2 (d=2): Each element adds element 2 positions left
Step 3 (d=4): Each element adds element 4 positions left
...
__global__ void hillis_steele_scan(int *input, int *output, int n) {
    extern __shared__ int temp[];
    int tid = threadIdx.x;

    // Load input into shared memory
    temp[tid] = (tid < n) ? input[tid] : 0;
    __syncthreads();

    for (int stride = 1; stride < n; stride *= 2) {
        int val = temp[tid];
        if (tid >= stride) {
            val += temp[tid - stride];
        }
        __syncthreads();
        temp[tid] = val;
        __syncthreads();
    }

    if (tid < n) output[tid] = temp[tid];
}
Hillis-Steele does O(N log N) work, not O(N). For large arrays on a GPU with limited parallelism, this extra work adds up. Blelloch’s work-efficient algorithm solves this.

Blelloch work-efficient scan

Blelloch’s algorithm (1990) achieves both O(log N) span and O(N) work — optimal in both dimensions. It proceeds in two phases.

Phase 1: Reduce (up-sweep)

Build a reduction tree from leaves to root. Each level halves the active elements:
Level 0 (input): [3,  1,  7,  0,  4,  1,  6,  3]
Level 1:         [3,  4,  7,  7,  4,  5,  6,  9]
Level 2:         [3,  4,  7, 11,  4,  5,  6, 14]
Level 3:         [3,  4,  7, 11,  4,  5,  6, 25]

Phase 2: Down-sweep

Set the root to 0, then propagate partial sums down the tree. Each node passes its value to its left child and the sum of itself and its left child to its right child.
__global__ void blelloch_scan(int *data, int n) {
    extern __shared__ int temp[];
    int tid = threadIdx.x;
    temp[tid] = data[tid];
    __syncthreads();

    // Up-sweep (reduce)
    for (int stride = 1; stride < n; stride *= 2) {
        int idx = (tid + 1) * stride * 2 - 1;
        if (idx < n) {
            temp[idx] += temp[idx - stride];
        }
        __syncthreads();
    }

    // Set root to identity (0 for addition)
    if (tid == 0) temp[n - 1] = 0;
    __syncthreads();

    // Down-sweep
    for (int stride = n / 2; stride >= 1; stride /= 2) {
        int idx = (tid + 1) * stride * 2 - 1;
        if (idx < n) {
            int left = temp[idx - stride];
            temp[idx - stride] = temp[idx];       // pass value to left child
            temp[idx] += left;                     // right child gets sum
        }
        __syncthreads();
    }

    data[tid] = temp[tid];
}
Work and span:
AlgorithmWorkSpan
SequentialO(N)O(N)
Hillis-SteeleO(N log N)O(log N)
BlellochO(N)O(log N)
Blelloch’s algorithm is work-optimal and achieves the same O(log N) parallel span as Hillis-Steele.
In practice, the lectures show that Blelloch requires careful attention to shared memory bank conflicts. The up-sweep and down-sweep access patterns can cause serialization unless padding is added to the shared memory array.

Multi-level scan for large arrays

A single kernel with shared memory can only scan arrays up to the shared memory size (typically 48–96 KB, or a few thousand elements). Scanning large arrays requires a hierarchical approach:
1

Block-level scan

Divide the array into blocks. Each block scans its segment using the single-block kernel and writes both the scan result and the block’s total sum to separate arrays.
2

Scan the block sums

Run a second scan pass on the array of block totals. This gives the global prefix sum up to each block boundary.
3

Add offsets

Add each block’s global prefix sum (from step 2) to every element in that block’s output (from step 1).
void large_array_scan(int *input, int *output, int n) {
    int num_blocks = (n + BLOCK_SIZE - 1) / BLOCK_SIZE;
    int *block_sums;
    cudaMalloc(&block_sums, num_blocks * sizeof(int));

    // Step 1: Scan each block, record partial sums
    scan_blocks<<<num_blocks, BLOCK_SIZE, BLOCK_SIZE * sizeof(int)>>>(
        input, output, block_sums, n
    );

    // Step 2: Scan the partial sums (recursively, if needed)
    int *scanned_block_sums;
    cudaMalloc(&scanned_block_sums, num_blocks * sizeof(int));
    scan_blocks<<<1, num_blocks, num_blocks * sizeof(int)>>>(
        block_sums, scanned_block_sums, nullptr, num_blocks
    );

    // Step 3: Add block offsets to each block's output
    add_block_offsets<<<num_blocks, BLOCK_SIZE>>>(
        output, scanned_block_sums, n
    );
}

Warp-level and block-level primitives

Modern CUDA provides intrinsics that implement scan within a warp (32 threads) in a single instruction, eliminating explicit shared memory and synchronization overhead.

Warp-level shuffle scan

__device__ int warp_scan(int val) {
    // Inclusive scan within a warp using shuffle
    for (int offset = 1; offset < 32; offset *= 2) {
        int neighbor = __shfl_up_sync(0xffffffff, val, offset);
        if (threadIdx.x % 32 >= offset) {
            val += neighbor;
        }
    }
    return val;
}
__shfl_up_sync moves values between threads in the same warp without going through shared memory. This gives warp-level scan at register speed.

CUB device-wide scan

NVIDIA’s CUB library (included with CUDA) provides highly optimized scan for production use:
#include <cub/cub.cuh>

// Determine temporary storage size
void *d_temp_storage = nullptr;
size_t temp_storage_bytes = 0;
cub::DeviceScan::InclusiveSum(
    d_temp_storage, temp_storage_bytes, d_in, d_out, num_items
);

// Allocate and run
cudaMalloc(&d_temp_storage, temp_storage_bytes);
cub::DeviceScan::InclusiveSum(
    d_temp_storage, temp_storage_bytes, d_in, d_out, num_items
);
Lecture 24 by Jake Hemstad and Georgii Evtushenko covers achieving scan at near-peak memory bandwidth using CUB and CCCL (CUDA Core Compute Libraries). CUB’s scan implementation uses decoupled look-back, an algorithm that avoids the three-pass hierarchical approach entirely.

Decoupled look-back

The state-of-the-art approach for large-array scan avoids the three-pass hierarchy. Each thread block writes a status flag plus its partial sum to a small status array as soon as it finishes. Later blocks spin on the status array of earlier blocks and can accumulate partial results as they become available, without waiting for a separate global synchronization step. This enables a single-pass scan with near-optimal memory bandwidth utilization.

Practical uses of scan

Scan is a primitive that appears throughout GPU algorithm design: Stream compaction — select elements matching a predicate, write them to a compact output array:
Input:    [7, 0, 3, 0, 5, 0, 8, 1]
Mask:     [1, 0, 1, 0, 1, 0, 1, 1]
Prefix:   [0, 1, 1, 2, 2, 3, 3, 4]  (exclusive scan of mask)
Output:   [7, 3, 5, 8, 1]           (gather using prefix as index)
Radix sort — each pass of a radix sort requires a scan to compute write positions. Histogram construction — the final gather step uses scan-derived indices. Variable-length data — scan over sizes array to compute offsets into a packed output buffer. Attention and recurrences — prefix sums generalize to matrix products, enabling linear attention and selective state-space models (S4, Mamba) to be computed in parallel rather than sequentially.

Further reading

Lecture 20 slides

Izzat El Haj’s slides on scan algorithm fundamentals

PMPP Book Ch. 11

Chapter 11 of Programming Massively Parallel Processors covers parallel scan in depth

CUB DeviceScan

NVIDIA CUB’s production-quality scan implementation

Decoupled look-back paper

Merrill & Garland, 2016 — single-pass parallel scan

Build docs developers (and LLMs) love