Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/gpu-mode/lectures/llms.txt

Use this file to discover all available pages before exploring further.

Lecture 4 by Thomas Viehmann covers the hardware reality that underpins every performance decision you make in CUDA. Most kernel optimization work comes down to one thing: getting data to the compute units fast enough. This page covers the GPU memory hierarchy, bandwidth vs. compute throughput, and the practical patterns that exploit memory locality.

The GPU memory hierarchy

Modern GPUs expose a multi-level memory hierarchy. Each level is faster, smaller, and closer to the compute units than the one below it.

Registers

Per-thread. ~0 cycles access latency. Finite: typically 255 registers per thread, ~65,536 per SM. Spilling to local memory maps back to global memory.

Shared memory / L1 cache

Per-block. ~5 ns access. 32–96 KB per SM (configurable split between L1 and shared). Programmer-managed. Used for tiling and reductions.

L2 cache

Per-GPU (shared across all SMs). ~100–200 ns. Several MB (e.g., 40 MB on H100). Managed by hardware. Reduces repeated global memory traffic.

Global memory (HBM/GDDR)

The main GPU DRAM. 16–80+ GB. ~400–800 cycles latency. 1–3 TB/s bandwidth (HBM). All threads in all blocks can access it. The bottleneck for most kernels.

Constant & texture memory

Specialized read-only caches. Constant memory is optimized for broadcast (all threads read the same address). Texture memory has a spatial locality cache.

Memory bandwidth vs. compute throughput

Most machine learning kernels are memory-bound: the GPU spends more time waiting for data than computing. The ratio of arithmetic operations to bytes loaded from memory is called arithmetic intensity.
Arithmetic intensity = FLOPs / bytes transferred
A simple element-wise operation like output = input * 2.0 loads 4 bytes and performs 1 FLOP — arithmetic intensity of 0.25 FLOP/byte. A matrix multiplication of large matrices approaches hundreds of FLOPs per byte and is compute-bound.
The roofline model plots achievable performance as a function of arithmetic intensity. Below the “ridge point” you are memory-bound; above it you are compute-bound. Understanding where your kernel sits on the roofline tells you which resource to optimize.

Roofline intuition

Peak performance (TFLOP/s)
        |              /‾‾‾‾‾‾‾‾‾‾‾ compute bound
        |            /
        |          /
        |        /   ← ridge point
        |      /
        |    /  memory bound
        |  /
        |/________________________
              arithmetic intensity (FLOP/byte)
For an A100:
  • Peak FP32 throughput: ~19.5 TFLOP/s
  • Peak memory bandwidth: ~2 TB/s
  • Ridge point: ~9.75 FLOP/byte
If your kernel’s arithmetic intensity is below ~9.75 FLOP/byte, you cannot saturate the compute units — more memory bandwidth or better locality is the fix.

Coalesced memory access

Global memory is accessed in 128-byte cache lines. When threads in a warp access memory, the hardware coalesces adjacent accesses into as few cache line transactions as possible. Non-coalesced access wastes bandwidth.
__global__ void copyDataCoalesced(float *in, float *out, int n) {
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    if (index < n) {
        out[index] = in[index];  // thread 0 → in[0], thread 1 → in[1], etc.
    }
}
In the coalesced version, 32 consecutive threads access 32 consecutive floats (128 bytes) — exactly one cache line per warp. In the non-coalesced version, the strided pattern touches twice as many cache lines for the same amount of useful data.
For 2D data like matrices stored in row-major order, column-major access patterns (iterating down a column) are non-coalesced. Transposing a matrix naively is a classic example — shared memory tiling is the standard fix.

Shared memory: on-chip scratch space

Shared memory is the key tool for avoiding repeated global memory loads. The pattern is:
  1. Cooperatively load a tile of data from global memory into shared memory
  2. Synchronize with __syncthreads()
  3. Compute using the fast on-chip copy
  4. Write results back to global memory

Shared memory reduction example

#define BLOCK_DIM 1024

__global__ void SharedMemoryReduction(float* input, float* output) {
    __shared__ float input_s[BLOCK_DIM];
    unsigned int t = threadIdx.x;

    // Load from global into shared memory
    input_s[t] = input[t] + input[t + BLOCK_DIM];

    // Tree reduction entirely in shared memory
    for (unsigned int stride = blockDim.x / 2; stride >= 1; stride /= 2) {
        __syncthreads();
        if (threadIdx.x < stride) {
            input_s[t] += input_s[t + stride];
        }
    }

    if (threadIdx.x == 0) {
        *output = input_s[0];
    }
}
This kernel loads data once from global memory into input_s, then performs the entire reduction in shared memory — far faster than going back to global memory at every step.

Shared memory bank conflicts

Shared memory is divided into 32 banks (one per warp lane). Simultaneous accesses to the same bank by different threads in a warp are serialized — this is a bank conflict.
Each thread in the warp accesses a different bank. E.g., thread t accesses shared[t] — adjacent threads hit adjacent banks. Full bandwidth.
Two threads access the same bank. The access is serialized into 2 sub-transactions — half bandwidth.
All threads access the same address. The hardware broadcasts the value to all threads in a single transaction — no penalty.
The typical fix for bank conflicts in matrix transpose and reduction kernels is to add a padding column: __shared__ float tile[BLOCK][BLOCK + 1]. The +1 shifts each row to a different set of banks.

Occupancy and register pressure

Occupancy is the ratio of active warps on an SM to the maximum number of warps the SM can support. Higher occupancy gives the scheduler more warps to hide latency with. Occupancy is limited by three resources, whichever is most constraining:
  1. Registers per thread — more registers per kernel = fewer threads can be in flight
  2. Shared memory per block — larger shared memory allocations = fewer blocks per SM
  3. Block size — blocks must be a multiple of 32 (warp size); small blocks reduce occupancy
// Query occupancy for a kernel at runtime
int minGridSize, blockSize;
cudaOccupancyMaxPotentialBlockSize(
    &minGridSize,
    &blockSize,
    myKernel,
    0,   // dynamic shared memory per block
    0    // block size limit (0 = no limit)
);
Use NVIDIA’s Nsight Compute (ncu) to see the occupancy of your kernel and exactly which resource is the bottleneck. The “Occupancy” section will show register usage, shared memory usage, and the theoretical vs. achieved occupancy.

Practical memory optimization checklist

Check that consecutive threads in a warp access consecutive memory addresses. Use ncu --metrics l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum to count load sectors and compare against ideal.
If the same global memory location is read more than once (by different threads, or in a loop), load it into shared memory first. The break-even point is roughly when each byte is read 2+ times.
Check your indexing patterns. Add padding (+1 to the second dimension) if threads in a warp access the same bank.
Use cudaOccupancyMaxPotentialBlockSize or the CUDA Occupancy Calculator to find the block size that maximizes occupancy for your kernel’s register and shared memory usage.
Data transfers over PCIe are ~10–50× slower than global memory bandwidth. Batch transfers, use pinned memory (cudaMallocHost), or keep data on the GPU between kernel calls.

Lecture references

Lecture 4 materials

Thomas Viehmann’s notebook and slides on compute and memory architecture

Reductions deep-dive

See shared memory in action in a full parallel reduction implementation

Build docs developers (and LLMs) love