Fusing GPU kernels means combining multiple operations that would normally run as separate CUDA kernels into a single kernel that keeps intermediate results in fast registers or shared memory. The reward is dramatic: operations that are memory-bandwidth-bound individually become compute-bound when fused, often delivering 2–5× speedups with no change to numerical results. This page is based on Lecture 18 by Kapil Sharma.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/gpu-mode/lectures/llms.txt
Use this file to discover all available pages before exploring further.
Why kernel fusion matters
Modern GPUs have a large arithmetic throughput gap relative to their memory bandwidth. An A100 can perform ~312 TFLOPS of FP16 matrix math, but its HBM bandwidth is ~2 TB/s. For a simple elementwise operation like ReLU on a 1 GB activation tensor:- Without fusion: read 1 GB from HBM, apply ReLU, write 1 GB back to HBM
- With fusion into the preceding matmul: the matmul’s output never leaves registers before ReLU is applied
The lecture covers this in the context of recommendation models (DLRM) rather than transformer attention. The fusion principles are the same, but the dominant operations differ: embedding lookups, dense linear layers, and interaction features rather than QKV projections.
Common fusion patterns
Pointwise + pointwise
The simplest fusion: chain multiple elementwise operations into a single kernel pass.Matmul + bias + activation
One of the most impactful fusion targets in neural networks. Instead of:- Launch GEMM kernel → write output to HBM
- Launch bias-add kernel → read/write output from HBM
- Launch activation kernel → read/write output from HBM
Normalization + scale + shift
Layer norm and RMS norm each require a statistics pass (mean, variance) and a normalization pass. When followed by a learned scale and shift (gamma, beta), the scale and shift can be fused into the normalization pass:torch.compile and kernel fusion
torch.compile is PyTorch’s primary interface for automatic kernel fusion. It traces the computation graph using TorchDynamo, then applies optimization passes (including fusion) through TorchInductor before generating either Triton or C++ CUDA kernels.
torch.compile generated, set the TORCH_LOGS environment variable:
output_triton_code/ and torch_compile_generated_triton.py as examples of this output.
Writing custom fused kernels in Triton
Whentorch.compile’s automatic fusion does not cover your pattern, or you need precise control over tile sizes, you can write fused kernels directly in Triton. Triton operates at the tile level, making fusion natural: compute one tile’s worth of operation A, then immediately apply operation B to that tile before moving on.
acc lives in registers throughout the entire kernel. The bias add and ReLU happen at register speed, not HBM speed.
LoRA fusion example
The lecture uses LoRA (Low-Rank Adaptation) as a motivating example. Standard LoRA computes:x from HBM. A fused approach loads x once and computes both paths in the same kernel or with a fused CUDA graph:
The lecture’s
lora_on_simple_mlp.py trains a small MLP on the Criteo click prediction dataset, using LoRA adapters on the dense layers. The fusion target here is recommendation model inference where the LoRA path adds minimal latency.Measuring speedup with profiling
Use PyTorch’s built-in profiler or NVIDIA’s Nsight Compute to verify fusion is happening and quantify the speedup:| Metric | Unfused | Fused |
|---|---|---|
| Number of CUDA kernel launches | Many small kernels | Fewer, larger kernels |
| HBM bytes read/written | High (intermediate tensors) | Low (only final outputs) |
| Memory bandwidth utilization | Near 100% (bound) | Reduced |
| Arithmetic intensity | Low | Higher |
Further reading
Lecture 18 code
Kapil Sharma’s fused kernel examples including LoRA and DLRM
Lecture 29: Triton Internals
How Triton compiles your kernel code down to PTX and SASS
Liger Kernel (Lecture 28)
Production-quality fused kernels for LLM training (RMSNorm, cross-entropy)
GPUs go brrr
Horace He’s guide to bandwidth vs. compute bottlenecks