Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/gpu-mode/lectures/llms.txt

Use this file to discover all available pages before exploring further.

SASS (Shader Assembly) is NVIDIA’s native GPU instruction set — the actual binary instructions that execute on the hardware. When profiling tools show that your kernel is slower than the theoretical peak, SASS analysis reveals why: stalls from instruction latency, bank conflicts, suboptimal instruction scheduling, or missed opportunities for instruction-level parallelism. This page is based on Lecture 37 by Arun Demeure.

What SASS is

Every CUDA or Triton kernel you write compiles to PTX (a virtual ISA), which is then compiled by the driver to SASS for the target GPU microarchitecture. SASS is:
  • Architecture-specific: Ampere SASS differs from Hopper SASS, unlike PTX which is forward-compatible
  • The ground truth: if you want to know exactly what instructions execute and how many cycles they take, SASS is the answer
  • Not usually written by hand: most engineers use SASS to read and analyze generated code, not to author kernels at this level
Understanding SASS closes the gap between “my kernel should be fast” and “my kernel is actually fast.” A CUDA kernel that looks efficient in C++ may have hidden inefficiencies visible only at the SASS level.
PTX is CUDA’s portable assembly — it targets a virtual machine and is forward-compatible across GPU generations. SASS is the final, architecture-specific machine code. The JIT compiler (NVVM/PTXAS) translates PTX → SASS at kernel launch time if a compiled version is not already cached.

Reading SASS from cuobjdump

The primary tool for inspecting SASS is cuobjdump, included with the CUDA toolkit:
# Disassemble a compiled CUDA binary
cuobjdump --dump-sass my_kernel.cubin

# Disassemble from an executable or shared library
cuobjdump --dump-sass my_app

# Dump both PTX and SASS
cuobjdump --dump-ptx --dump-sass my_kernel.cubin

# Filter to a specific function
cuobjdump --dump-sass my_app | grep -A 100 "Function: my_kernel_name"
To get a .cubin from a CUDA source file:
nvcc -arch=sm_80 -cubin my_kernel.cu -o my_kernel.cubin

Reading SASS on Godbolt

The Compiler Explorer (godbolt.org) supports CUDA SASS output directly. Enter a CUDA kernel, select the NVCC compiler, and add -cubin to the flags. This is useful for quick experiments and for sharing SASS annotations with colleagues without needing a local CUDA toolkit. A typical SASS snippet for a floating-point multiply-accumulate looks like:
/*0050*/  FFMA R4, R6, R8, R10;
/*0058*/  LDG.E.SYS R12, [R0];
/*0060*/  STG.E.SYS [R2], R4;
/*0068*/  IADD3 R0, R0, 0x4, RZ;

Key SASS instructions

Understanding a handful of core instructions covers most of what you encounter in compute kernels:
InstructionDescription
FFMAFP32 fused multiply-add: dst = a * b + c
HFMA2FP16 paired fused multiply-add (two FP16 ops in parallel)
HMMA / WMMATensor Core MMA operation
LDGLoad from global memory (HBM)
LDSLoad from shared memory
STGStore to global memory
STSStore to shared memory
LDGDEPBARLoad-global dependency barrier (for async copies)
BARBlock-level barrier (equivalent to __syncthreads())
SHFLWarp shuffle — register exchange between threads
VOTEWarp vote instructions (ballot, all, any)
IADD33-way integer add (used in address arithmetic)
IMADInteger multiply-add
S2RSpecial register to register move (reads %tid, %ctaid, etc.)
Instruction modifiers are important: .E on loads/stores means “extended” (64-bit address), .SYS indicates cache policy, and .128 on a load means a 128-byte wide load (four consecutive float4 values).

Instruction-level parallelism and latency hiding

GPU performance depends on keeping functional units occupied. Each instruction has a latency (cycles from issue to result available) and a throughput (instructions per cycle). The GPU schedules four warp schedulers per SM, each selecting a ready warp every cycle. Typical latencies on Ampere:
InstructionLatency (cycles)
FFMA (FP32)~4
LDS (shared memory)~20–30
LDG (global memory, cache hit)~30–40
LDG (global memory, cache miss)~200–800
HMMA (Tensor Core)~16
The GPU hides these latencies by interleaving instructions from multiple independent warps. If your kernel has too few active warps (low occupancy), stalls from long-latency instructions are exposed as idle cycles. Identifying stalls in SASS: The --print-instruction-stats flag in Nsight Compute shows, for each SASS instruction, how many cycles warps stalled waiting for it. An LDG with many Long Scoreboard stalls indicates data arriving from HBM is on the critical path.
# Profile with instruction-level stats
ncu --metrics l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum \
          smsp__warp_issue_stalled_long_scoreboard_per_warp_active.pct \
          ./my_kernel

Register file and bank conflicts at the SASS level

The register file is the fastest storage on the GPU. But it is divided into banks, and multiple threads accessing the same bank in the same cycle cause conflicts — reducing effective register bandwidth. Register bank conflicts appear in SASS as repeated reads from the same register number across instructions scheduled in the same cycle slot. The SASS scheduler sometimes reorders or splits instructions to avoid conflicts; when it cannot, you see bank conflict stalls. To inspect register allocation, look for the .REG_ALLOC_BANKS annotation in Nsight Compute’s source-correlated view, or count how many registers each warp uses:
# Show register count and shared memory per kernel
cuobjdump --dump-resource-usage my_kernel.cubin
High register usage limits the number of active warps (occupancy). The trade-off: more registers per thread enables more instruction-level parallelism within a thread, but reduces the parallelism available across threads.
Reducing register usage to increase occupancy does not always improve performance. If your kernel is compute-bound rather than latency-bound, high occupancy adds no benefit. Profile first — then tune.

Using SASS insights to guide optimization

SASS analysis is most useful after profiling identifies a bottleneck. Common findings and their fixes: Many LDG stalls → memory access pattern needs improvement
  • Verify global loads are coalesced (128-byte aligned, contiguous across threads)
  • Use __ldg() or const __restrict__ to enable the read-only cache path
  • Prefetch data into shared memory with async copies (cp.async)
Many LDS stalls → shared memory bank conflicts
  • Add padding to shared memory arrays: __shared__ float smem[ROWS][COLS + 1]
  • Rearrange access patterns so threads in the same warp access different banks (banks are 4-byte-aligned, 32 banks on Volta+)
Low FFMA throughput → instruction mix is off
  • The ratio of FFMA to LDG instructions tells you whether you are compute or bandwidth bound
  • An FFMA-heavy kernel with many stalls suggests occupancy is too low to hide latency
Unexpected __syncthreads overhead
  • Each BAR.SYNC in SASS blocks a warp until all threads in the block arrive
  • Minimize synchronization points; eliminate unnecessary __syncthreads() calls

Tools: ncu, cuobjdump, Nsight Compute

1

Profile with ncu

Run ncu --set full ./my_app to collect a comprehensive set of metrics. The full set includes memory throughput, compute throughput, warp stall reasons, and instruction statistics.
2

Identify the bottleneck

Look at the “Speed of Light” section of the Nsight Compute report. It shows what percentage of peak memory bandwidth and compute throughput your kernel achieves.
3

Drill into SASS

Use the Source page in Nsight Compute’s GUI to see SASS instructions correlated with stall counts and warp divergence. Hover over an instruction to see its latency contribution.
4

Validate with cuobjdump

Use cuobjdump --dump-sass to inspect the raw SASS and verify that the compiler is generating the instruction sequence you expect (e.g., vectorized loads, correct unroll factors).
5

Iterate on the CUDA/Triton source

Use SASS insights to guide changes to your high-level kernel: adjust tile sizes, add prefetching, change shared memory layout, or hint the compiler with #pragma unroll.
# Collect a full Nsight Compute report
ncu --set full --export report.ncu-rep ./my_app

# Open the report in the Nsight Compute GUI
ncu-ui report.ncu-rep

# Quick command-line summary: top stall reasons
ncu --metrics smsp__warp_issue_stalled_long_scoreboard_per_warp_active.pct,\
smsp__warp_issue_stalled_short_scoreboard_per_warp_active.pct,\
smsp__warp_issue_stalled_wait_per_warp_active.pct \
./my_app
Lecture 37’s slides in the lecture_037/ folder include annotated SASS examples from real kernels showing exactly which stall types to look for and how to interpret the Nsight Compute warp stall breakdown.

Further reading

Lecture 37 slides

Arun Demeure’s slides on SASS and GPU microarchitecture

Lecture 8: CUDA Performance Checklist

Mark Saroufim’s high-level checklist before going to SASS-level analysis

Nsight Compute documentation

NVIDIA’s profiler for kernel-level GPU performance analysis

CUDA C++ Best Practices Guide

NVIDIA’s canonical guide to performance optimization patterns

Build docs developers (and LLMs) love