Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/gpu-mode/lectures/llms.txt

Use this file to discover all available pages before exploring further.

Lecture 8 by Mark Saroufim distills GPU kernel optimization into a concrete checklist. The core principle is simple: always profile first, never guess. Every item on this checklist has a measurable signal — a tool output, a counter, or a metric that tells you whether you have a problem and how severe it is. The code for this lecture lives in the lecture_008/ folder of the GPU Mode lectures repository.

Understand your hardware first

Before you can reason about performance, you need to know what your GPU is capable of. The print_cuda_info() function from utils.py gives you the baseline:
import torch
import subprocess

def print_cuda_info():
    print("=== PyTorch CUDA Info ===")
    print(f"PyTorch version: {torch.__version__}")
    print(f"CUDA available: {torch.cuda.is_available()}")
    print(f"CUDA version: {torch.version.cuda}")
    print(f"cuDNN version: {torch.backends.cudnn.version()}")
    print(f"Number of GPUs: {torch.cuda.device_count()}")

    for i in range(torch.cuda.device_count()):
        print(f"  GPU {i}: {torch.cuda.get_device_name(i)}")
        print(f"    Current device: {torch.cuda.current_device()}")
        print(f"    Memory allocated: {torch.cuda.memory_allocated(i)/1e6:.2f} MB")
        print(f"    Memory cached   : {torch.cuda.memory_reserved(i)/1e6:.2f} MB")

    print("\n=== nvidia-smi Info (if available) ===")
    try:
        subprocess.run(["nvidia-smi"], check=True)
    except Exception as e:
        print(f"nvidia-smi not available: {e}")
Call this before profiling any kernel. You need to know your GPU’s peak FLOP/s and memory bandwidth to interpret profiler outputs — the roofline model requires both numbers.

Compiling and profiling

The lecture’s README.md shows the minimal workflow:
nvcc -o benchmark file.cu
ncu benchmark
ncu (Nsight Compute) gives you per-kernel metrics: achieved bandwidth, achieved throughput, occupancy, stall reasons, and more. Always use ncu or nsys (Nsight Systems) before drawing any conclusions about a kernel’s performance.
Never optimize without profiling. Intuition about GPU bottlenecks is frequently wrong. A kernel that “looks” slow because of a nested loop may actually be memory-bound, and restructuring the loop will change nothing.

The performance checklist

Run nsys for a system-level timeline (which kernels take the most time, PCIe transfers, synchronization gaps) and ncu for per-kernel hardware metrics.
# System profiling — timeline view
nsys profile --stats=true python train.py

# Kernel profiling — hardware counters
ncu --set full python train.py
Key questions to answer before touching any code:
  • Which kernel takes the most wall-clock time?
  • Is it memory-bound or compute-bound?
  • What is the achieved vs. theoretical bandwidth/throughput?
Non-coalesced global memory access is one of the most common and impactful bottlenecks. From lecture_008/coalesce.cu:
// Coalesced: thread i reads in[i] — adjacent threads, adjacent addresses
__global__ void copyDataCoalesced(float *in, float *out, int n) {
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    if (index < n) {
        out[index] = in[index];
    }
}

// Non-coalesced: thread i reads in[(i*2) % n] — strided access pattern
__global__ void copyDataNonCoalesced(float *in, float *out, int n) {
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    if (index < n) {
        out[index] = in[(index * 2) % n];
    }
}
In ncu, look at l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum — a high sector count relative to the number of requests indicates poor coalescing.
Low occupancy means the GPU doesn’t have enough warps to hide memory latency. From lecture_008/occupancy.cu, the CUDA API can suggest an optimal block size:
int minGridSize, blockSize;
cudaOccupancyMaxPotentialBlockSize(
    &minGridSize, &blockSize, copyDataCoalesced, 0, 0
);
// Prints recommended block size and minimum grid size
std::cout << "Recommended block size: " << blockSize
          << ", Minimum grid size: " << minGridSize << std::endl;
Common causes of low occupancy:
  • Too many registers per thread — reduce register usage or use __launch_bounds__
  • Too much shared memory per block — fewer blocks can fit per SM
  • Block size too small — use at least 128 threads per block (4 warps)
Compute your kernel’s arithmetic intensity:
arithmetic intensity = FLOPs / bytes_transferred
Compare it to your GPU’s ridge point (peak FLOPs ÷ peak bandwidth). If you are below the ridge point, more compute won’t help — you need to reduce memory traffic or increase reuse via tiling and shared memory.For an A100 SXM4:
  • FP32 peak: 19.5 TFLOP/s
  • Memory bandwidth: 2.0 TB/s
  • Ridge point: ~9.75 FLOP/byte
Branches where threads in the same warp take different paths are serialized. From lecture_008/divergence.cu:
// Divergent: half the warp takes each branch
__global__ void processArrayWithDivergence(int *data, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        if (data[idx] % 2 == 0) {
            data[idx] = data[idx] * 2;       // even branch
        } else {
            data[idx] = data[idx] + 1;       // odd branch
        }
    }
}

// No divergence: branchless equivalent
__global__ void processArrayWithoutDivergence(int *data, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        int isEven = !(data[idx] % 2);
        data[idx] = isEven * (data[idx] * 2) + (!isEven) * (data[idx] + 1);
    }
}
In ncu, look at smsp__thread_inst_executed_pred_on.avg.pct_of_peak_sustained_active — low values indicate divergence.
Each kernel launch has fixed overhead (~5–10 µs on CPU side). If your pipeline launches many small kernels, the launch overhead accumulates. Look for opportunities to fuse kernels: combine two or more sequential operations into one kernel, sharing the data load.Common fusion opportunities:
  • Activation function + bias add after a linear layer
  • Layer norm + dropout
  • Any element-wise operation after a compute-heavy kernel
Use nsys to identify sequences of short kernels with large CPU gaps between them.
PCIe bandwidth (~32 GB/s) is ~60–90× slower than GPU memory bandwidth (~2 TB/s on HBM). Every unnecessary cudaMemcpy is expensive.Strategies to minimize transfers:
  • Keep tensors on the GPU between operations
  • Use torch.cuda.Stream to overlap computation and transfers
  • Use pinned (page-locked) memory for faster transfers: cudaMallocHost
  • Profile with nsys — PCIe transfers show as DMA events in the timeline
# Bad: unnecessary round-trip
x = model(x.cuda()).cpu()
y = another_model(x.cuda())

# Good: stay on GPU
x = x.cuda()
x = model(x)
y = another_model(x)
Thread coarsening lets each thread do more work, reducing the overhead of block/thread management and improving arithmetic intensity. From lecture_008/coarsening.cu:
// Original: one thread per element
__global__ void VecAdd(float* A, float* B, float* C) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < N) C[i] = A[i] + B[i];
}

// Coarsened: one thread handles two elements
__global__ void VecAddCoarsened(float* A, float* B, float* C) {
    int i = (blockIdx.x * blockDim.x + threadIdx.x) * 2;
    if (i < N)     C[i]     = A[i]     + B[i];
    if (i + 1 < N) C[i + 1] = A[i + 1] + B[i + 1];
}
Coarsening is most effective when a kernel is latency-bound due to too many threads competing for shared resources.

Quick reference table

IssueSignalFix
Memory bandwidth bottleneckncu achieved BW < theoreticalImprove coalescing, use shared memory tiling
Low occupancyncu occupancy < 50%Reduce registers, reduce shared memory, increase block size
Warp divergencencu low thread utilizationRewrite branches as branchless arithmetic
Kernel launch overheadnsys many short kernelsFuse kernels
PCIe transfer overheadnsys DMA events between kernelsKeep data on GPU, use async transfers
Compute-boundncu SM utilization near 100%Use tensor cores, reduce FLOPs, use lower precision

Lecture references

Lecture 8 code

coalesce.cu, divergence.cu, occupancy.cu, coarsening.cu, and more

Lecture 8 slides

CUDA Performance Checklist slides by Mark Saroufim

Build docs developers (and LLMs) love