SASS (Shader Assembly) is NVIDIA’s native GPU instruction set — the actual binary instructions that execute on the hardware. When profiling tools show that your kernel is slower than the theoretical peak, SASS analysis reveals why: stalls from instruction latency, bank conflicts, suboptimal instruction scheduling, or missed opportunities for instruction-level parallelism. This page is based on Lecture 37 by Arun Demeure.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/gpu-mode/lectures/llms.txt
Use this file to discover all available pages before exploring further.
What SASS is
Every CUDA or Triton kernel you write compiles to PTX (a virtual ISA), which is then compiled by the driver to SASS for the target GPU microarchitecture. SASS is:- Architecture-specific: Ampere SASS differs from Hopper SASS, unlike PTX which is forward-compatible
- The ground truth: if you want to know exactly what instructions execute and how many cycles they take, SASS is the answer
- Not usually written by hand: most engineers use SASS to read and analyze generated code, not to author kernels at this level
PTX is CUDA’s portable assembly — it targets a virtual machine and is forward-compatible across GPU generations. SASS is the final, architecture-specific machine code. The JIT compiler (NVVM/PTXAS) translates PTX → SASS at kernel launch time if a compiled version is not already cached.
Reading SASS from cuobjdump
The primary tool for inspecting SASS iscuobjdump, included with the CUDA toolkit:
.cubin from a CUDA source file:
Reading SASS on Godbolt
The Compiler Explorer (godbolt.org) supports CUDA SASS output directly. Enter a CUDA kernel, select the NVCC compiler, and add-cubin to the flags. This is useful for quick experiments and for sharing SASS annotations with colleagues without needing a local CUDA toolkit.
A typical SASS snippet for a floating-point multiply-accumulate looks like:
Key SASS instructions
Understanding a handful of core instructions covers most of what you encounter in compute kernels:| Instruction | Description |
|---|---|
FFMA | FP32 fused multiply-add: dst = a * b + c |
HFMA2 | FP16 paired fused multiply-add (two FP16 ops in parallel) |
HMMA / WMMA | Tensor Core MMA operation |
LDG | Load from global memory (HBM) |
LDS | Load from shared memory |
STG | Store to global memory |
STS | Store to shared memory |
LDGDEPBAR | Load-global dependency barrier (for async copies) |
BAR | Block-level barrier (equivalent to __syncthreads()) |
SHFL | Warp shuffle — register exchange between threads |
VOTE | Warp vote instructions (ballot, all, any) |
IADD3 | 3-way integer add (used in address arithmetic) |
IMAD | Integer multiply-add |
S2R | Special register to register move (reads %tid, %ctaid, etc.) |
.E on loads/stores means “extended” (64-bit address), .SYS indicates cache policy, and .128 on a load means a 128-byte wide load (four consecutive float4 values).
Instruction-level parallelism and latency hiding
GPU performance depends on keeping functional units occupied. Each instruction has a latency (cycles from issue to result available) and a throughput (instructions per cycle). The GPU schedules four warp schedulers per SM, each selecting a ready warp every cycle. Typical latencies on Ampere:| Instruction | Latency (cycles) |
|---|---|
FFMA (FP32) | ~4 |
LDS (shared memory) | ~20–30 |
LDG (global memory, cache hit) | ~30–40 |
LDG (global memory, cache miss) | ~200–800 |
HMMA (Tensor Core) | ~16 |
--print-instruction-stats flag in Nsight Compute shows, for each SASS instruction, how many cycles warps stalled waiting for it. An LDG with many Long Scoreboard stalls indicates data arriving from HBM is on the critical path.
Register file and bank conflicts at the SASS level
The register file is the fastest storage on the GPU. But it is divided into banks, and multiple threads accessing the same bank in the same cycle cause conflicts — reducing effective register bandwidth. Register bank conflicts appear in SASS as repeated reads from the same register number across instructions scheduled in the same cycle slot. The SASS scheduler sometimes reorders or splits instructions to avoid conflicts; when it cannot, you see bank conflict stalls. To inspect register allocation, look for the.REG_ALLOC_BANKS annotation in Nsight Compute’s source-correlated view, or count how many registers each warp uses:
Using SASS insights to guide optimization
SASS analysis is most useful after profiling identifies a bottleneck. Common findings and their fixes: ManyLDG stalls → memory access pattern needs improvement
- Verify global loads are coalesced (128-byte aligned, contiguous across threads)
- Use
__ldg()orconst __restrict__to enable the read-only cache path - Prefetch data into shared memory with async copies (
cp.async)
LDS stalls → shared memory bank conflicts
- Add padding to shared memory arrays:
__shared__ float smem[ROWS][COLS + 1] - Rearrange access patterns so threads in the same warp access different banks (banks are 4-byte-aligned, 32 banks on Volta+)
- The ratio of FFMA to LDG instructions tells you whether you are compute or bandwidth bound
- An FFMA-heavy kernel with many stalls suggests occupancy is too low to hide latency
__syncthreads overhead
- Each
BAR.SYNCin SASS blocks a warp until all threads in the block arrive - Minimize synchronization points; eliminate unnecessary
__syncthreads()calls
Tools: ncu, cuobjdump, Nsight Compute
Profile with ncu
Run
ncu --set full ./my_app to collect a comprehensive set of metrics. The full set includes memory throughput, compute throughput, warp stall reasons, and instruction statistics.Identify the bottleneck
Look at the “Speed of Light” section of the Nsight Compute report. It shows what percentage of peak memory bandwidth and compute throughput your kernel achieves.
Drill into SASS
Use the Source page in Nsight Compute’s GUI to see SASS instructions correlated with stall counts and warp divergence. Hover over an instruction to see its latency contribution.
Validate with cuobjdump
Use
cuobjdump --dump-sass to inspect the raw SASS and verify that the compiler is generating the instruction sequence you expect (e.g., vectorized loads, correct unroll factors).Further reading
Lecture 37 slides
Arun Demeure’s slides on SASS and GPU microarchitecture
Lecture 8: CUDA Performance Checklist
Mark Saroufim’s high-level checklist before going to SASS-level analysis
Nsight Compute documentation
NVIDIA’s profiler for kernel-level GPU performance analysis
CUDA C++ Best Practices Guide
NVIDIA’s canonical guide to performance optimization patterns