Use this file to discover all available pages before exploring further.
The prefix scan (prefix sum) is one of the fundamental building blocks of parallel computing. Nearly every GPU algorithm that requires data-dependent indexing — stream compaction, sorting, histogram construction, recurrent neural networks — relies on an efficient scan implementation. This page covers Lectures 20 and 21 by Izzat El Haj and Lecture 24 by Jake Hemstad and Georgii Evtushenko.
A prefix scan takes an array of values and an associative binary operator (typically addition) and produces an output array where each element is the accumulated result of all preceding elements:Inclusive scan (each output includes the current element):
Scan generalizes beyond addition: any associative operator works (max, min, logical AND, bitwise OR, matrix multiplication). The GPU-Mode lectures focus on additive scan as the canonical case.
A sequential scan is straightforward: iterate over the array, accumulating as you go.
void sequential_scan(int *input, int *output, int n) { output[0] = input[0]; for (int i = 1; i < n; i++) { output[i] = output[i-1] + input[i]; }}
This runs in O(N) time with O(N) work. Parallelizing scan is non-trivial because each output depends on all previous outputs — a data dependency chain that limits naive parallelism.
The Hillis-Steele algorithm (1986) achieves O(log N) span (parallel steps) but uses O(N log N) total work — more work than the sequential algorithm.Algorithm:In each step d (from 1 to log₂N), each element i ≥ 2^(d-1) adds the element 2^(d-1) positions to its left:
Step 1 (d=1): Each element adds element 1 position leftStep 2 (d=2): Each element adds element 2 positions leftStep 3 (d=4): Each element adds element 4 positions left...
__global__ void hillis_steele_scan(int *input, int *output, int n) { extern __shared__ int temp[]; int tid = threadIdx.x; // Load input into shared memory temp[tid] = (tid < n) ? input[tid] : 0; __syncthreads(); for (int stride = 1; stride < n; stride *= 2) { int val = temp[tid]; if (tid >= stride) { val += temp[tid - stride]; } __syncthreads(); temp[tid] = val; __syncthreads(); } if (tid < n) output[tid] = temp[tid];}
Hillis-Steele does O(N log N) work, not O(N). For large arrays on a GPU with limited parallelism, this extra work adds up. Blelloch’s work-efficient algorithm solves this.
Set the root to 0, then propagate partial sums down the tree. Each node passes its value to its left child and the sum of itself and its left child to its right child.
__global__ void blelloch_scan(int *data, int n) { extern __shared__ int temp[]; int tid = threadIdx.x; temp[tid] = data[tid]; __syncthreads(); // Up-sweep (reduce) for (int stride = 1; stride < n; stride *= 2) { int idx = (tid + 1) * stride * 2 - 1; if (idx < n) { temp[idx] += temp[idx - stride]; } __syncthreads(); } // Set root to identity (0 for addition) if (tid == 0) temp[n - 1] = 0; __syncthreads(); // Down-sweep for (int stride = n / 2; stride >= 1; stride /= 2) { int idx = (tid + 1) * stride * 2 - 1; if (idx < n) { int left = temp[idx - stride]; temp[idx - stride] = temp[idx]; // pass value to left child temp[idx] += left; // right child gets sum } __syncthreads(); } data[tid] = temp[tid];}
Work and span:
Algorithm
Work
Span
Sequential
O(N)
O(N)
Hillis-Steele
O(N log N)
O(log N)
Blelloch
O(N)
O(log N)
Blelloch’s algorithm is work-optimal and achieves the same O(log N) parallel span as Hillis-Steele.
In practice, the lectures show that Blelloch requires careful attention to shared memory bank conflicts. The up-sweep and down-sweep access patterns can cause serialization unless padding is added to the shared memory array.
A single kernel with shared memory can only scan arrays up to the shared memory size (typically 48–96 KB, or a few thousand elements). Scanning large arrays requires a hierarchical approach:
1
Block-level scan
Divide the array into blocks. Each block scans its segment using the single-block kernel and writes both the scan result and the block’s total sum to separate arrays.
2
Scan the block sums
Run a second scan pass on the array of block totals. This gives the global prefix sum up to each block boundary.
3
Add offsets
Add each block’s global prefix sum (from step 2) to every element in that block’s output (from step 1).
void large_array_scan(int *input, int *output, int n) { int num_blocks = (n + BLOCK_SIZE - 1) / BLOCK_SIZE; int *block_sums; cudaMalloc(&block_sums, num_blocks * sizeof(int)); // Step 1: Scan each block, record partial sums scan_blocks<<<num_blocks, BLOCK_SIZE, BLOCK_SIZE * sizeof(int)>>>( input, output, block_sums, n ); // Step 2: Scan the partial sums (recursively, if needed) int *scanned_block_sums; cudaMalloc(&scanned_block_sums, num_blocks * sizeof(int)); scan_blocks<<<1, num_blocks, num_blocks * sizeof(int)>>>( block_sums, scanned_block_sums, nullptr, num_blocks ); // Step 3: Add block offsets to each block's output add_block_offsets<<<num_blocks, BLOCK_SIZE>>>( output, scanned_block_sums, n );}
Modern CUDA provides intrinsics that implement scan within a warp (32 threads) in a single instruction, eliminating explicit shared memory and synchronization overhead.
Lecture 24 by Jake Hemstad and Georgii Evtushenko covers achieving scan at near-peak memory bandwidth using CUB and CCCL (CUDA Core Compute Libraries). CUB’s scan implementation uses decoupled look-back, an algorithm that avoids the three-pass hierarchical approach entirely.
The state-of-the-art approach for large-array scan avoids the three-pass hierarchy. Each thread block writes a status flag plus its partial sum to a small status array as soon as it finishes. Later blocks spin on the status array of earlier blocks and can accumulate partial results as they become available, without waiting for a separate global synchronization step.This enables a single-pass scan with near-optimal memory bandwidth utilization.
Scan is a primitive that appears throughout GPU algorithm design:Stream compaction — select elements matching a predicate, write them to a compact output array:
Radix sort — each pass of a radix sort requires a scan to compute write positions.Histogram construction — the final gather step uses scan-derived indices.Variable-length data — scan over sizes array to compute offsets into a packed output buffer.Attention and recurrences — prefix sums generalize to matrix products, enabling linear attention and selective state-space models (S4, Mamba) to be computed in parallel rather than sequentially.