AMD’s GPU stack has matured considerably with ROCm, HIP, and the Composable Kernel (CK) library. For practitioners coming from CUDA, the transition is mostly mechanical — HIP is deliberately API-compatible — but squeezing peak performance on AMD Instinct accelerators requires understanding the differences in memory hierarchy, matrix units, and the tile-based programming model that CK exposes. This page accompanies Lecture 25 by Haocong Wang.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/gpu-mode/lectures/llms.txt
Use this file to discover all available pages before exploring further.
Lecture 25 slides are available in the lecture repository at
lecture_025/AMD_ROCm_Speaking_Composable_Kernel_July_20_2024.pdf.AMD GPU architecture: RDNA vs. CDNA
AMD produces two distinct GPU microarchitecture families with very different design goals:RDNA (consumer/gaming)
Optimized for rasterization throughput and display output. Found in Radeon RX series. Has limited FP64 and no MFMA matrix units — not the primary target for ML training workloads.
CDNA (compute/datacenter)
Optimized for HPC and ML. Found in AMD Instinct MI100, MI200, MI300 series. Features large register files, high-bandwidth HBM memory, and the MFMA (Matrix Fused Multiply-Add) instruction set.
MI200 and MI300 series highlights
| Feature | MI200 (CDNA2) | MI300X (CDNA3) |
|---|---|---|
| Architecture | CDNA2 | CDNA3 |
| HBM capacity | 128 GB (2× GPU) | 192 GB |
| HBM bandwidth | 3.2 TB/s | 5.3 TB/s |
| Peak FP16 TFLOPS | 383 | 1307 |
| Matrix units | MFMA | MFMA |
| NVLink equivalent | Infinity Fabric | Infinity Fabric |
ROCm: AMD’s GPU computing platform
ROCm (Radeon Open Compute) is the open-source software stack that sits between hardware and user code, analogous to CUDA Toolkit. It includes:- HIP runtime: CUDA-compatible API layer
- rocBLAS / hipBLAS: BLAS for AMD GPUs (equivalent to cuBLAS)
- MIOpen: DNN primitives (equivalent to cuDNN)
- rocPROF: GPU profiler (equivalent to Nsight)
- Composable Kernel: high-performance kernel library (unique to AMD)
HIP: CUDA-compatible programming model
HIP (Heterogeneous-computing Interface for Portability) mirrors the CUDA API almost exactly. Most CUDA code can be mechanically converted with thehipify-perl tool:
- CUDA
- HIP
HIP_PLATFORM:
The wavefront (AMD’s equivalent of a CUDA warp) is 64 threads wide on CDNA architecture (compared to NVIDIA’s 32). This is a meaningful architectural difference — occupancy calculations and register pressure analysis differ from CUDA accordingly. RDNA uses 32-thread wavefronts.
Composable Kernel (CK) overview
Composable Kernel is AMD’s high-performance kernel library for ML workloads on ROCm. Unlike cuBLAS or rocBLAS, CK is open-source and designed to be composed — library users can mix and match operation types, data types, and layout configurations without forking kernel code. CK’s key design principles:- Tile-based programming: all work is expressed as operations on tiles that map to the GPU memory hierarchy (global → LDS → registers).
- Template metaprogramming: kernel configurations (tile sizes, pipeline stages, instruction types) are compile-time template parameters, not runtime switches.
- MFMA-first: the inner loop targets AMD’s MFMA matrix instructions directly, rather than relying on the compiler to discover them.
CK’s tile-based programming model
CK decomposes a GEMM into a three-level tile hierarchy that maps directly onto AMD GPU memory levels:GridwiseGemm template. You specify:
BlockSize: threads per threadblockMPerBlock,NPerBlock,KPerBlock: tile dimensions at the threadblock levelMPerWave,NPerWave: tile dimensions at the wavefront levelMRepeat,NRepeat: how many MFMA results each thread accumulates
Writing a GEMM with Composable Kernel
CK provides both a high-level “client” API and direct access to building-block templates. The client API is the recommended starting point:The MFMA instruction
At the hardware level, CK targets thev_mfma_* instructions. For FP16:
Key differences from CUDA and cuBLAS
Understanding these differences avoids subtle performance bugs when porting CUDA kernels to ROCm:| Aspect | CUDA / NVIDIA | HIP / AMD CDNA |
|---|---|---|
| Warp / wavefront width | 32 threads | 64 threads (CDNA), 32 (RDNA) |
| Shared memory | 48–228 KB (configurable) | LDS: 64 KB per CU |
| Matrix instruction | WMMA / MMA (Tensor Core) | MFMA (v_mfma_*) |
| WMMA fragment size | 16×16 | 16×16 or 32×32 |
| L1 cache | Per SM, configurable | L1 per CU, less configurable |
| Occupancy tuning | cudaFuncSetCacheConfig | LDS and register pressure govern occupancy directly |
| Profiling | ncu, Nsight Systems | rocprof, Omniperf |
Profiling with rocprof and Omniperf
SYCL mode: Intel GPU portability (Lecture 26)
Lecture 26 by Patric Zhao covers SYCL MODE — running the same compute kernels on Intel GPUs using SYCL (via Intel’s oneAPI DPC++ compiler). SYCL is a Khronos standard C++ abstraction layer that runs on top of OpenCL, Level Zero (Intel), CUDA (via Codeplay), and HIP (via AMD’s implementation).Lecture 26: SYCL MODE
Patric Zhao’s slides on Intel GPU programming with SYCL and oneAPI
Intel oneAPI docs
Official Intel oneAPI developer documentation and DPC++ reference
Cross-platform portability considerations
Writing portable GPU kernels that run on NVIDIA, AMD, and Intel hardware is achievable with the right abstraction layer:Use HIP as the primary layer
HIP compiles to both CUDA and ROCm. For NVIDIA targets, set
HIP_PLATFORM=nvidia; for AMD, use the default ROCm backend. This covers ~90% of use cases with minimal code changes.Abstract matrix instructions behind compile-time dispatch
Use
#ifdef __HIP_PLATFORM_AMD__ to select MFMA vs. #ifdef __CUDA_ARCH__ for WMMA. CK and CUTLASS handle this internally — prefer library APIs over raw intrinsics when possible.Profile on each target separately
Optimal tile sizes and pipeline depths differ between MI300X and H100. A kernel tuned for one will likely underperform on the other by 20–40%. Use autotuning frameworks (CK profiler, CUTLASS profiler) rather than hardcoding tile sizes.
Composable Kernel on GitHub
Source code, examples, and instance profiler for AMD’s CK library
Lecture 25 slides
Haocong Wang’s full slide deck: Speaking Composable Kernel (July 2024)
ROCm documentation
AMD’s official ROCm programming guide and API reference
HIP porting guide
Official guide for porting CUDA code to HIP/ROCm