Use this file to discover all available pages before exploring further.
Apple Silicon brought a fundamentally different GPU architecture to developers — unified memory, a tightly integrated neural engine, and a GPU programming model built on Metal rather than CUDA. This page is a companion to Lecture 31 by Nikita Shulga, which walks through writing GPU compute kernels in Metal Shading Language (MSL) and benchmarks three progressively optimized GEMM implementations on Apple Silicon.
All code examples on this page are drawn from gemm_perf_studies.mm in the lecture repository. Compile with: clang++ --std=c++17 gemm_perf_studies.mm -framework Metal -framework Foundation
Apple Silicon (M1/M2/M3/M4 families) integrates CPU, GPU, and Neural Engine on a single die with a shared memory fabric — there is no separate “device memory” to copy to and from.Key characteristics relevant to kernel writers:
Unified memory architecture (UMA): CPU and GPU share the same physical RAM. Allocating a buffer with MTLResourceStorageModeShared makes it readable and writable by both without any memcpy.
Tile memory: fast on-chip scratchpad analogous to CUDA shared memory; called threadgroup memory in MSL.
SIMD groups: groups of 32 threads that execute in lockstep (equivalent to CUDA warps). SIMD-group operations like simd_sum and simd_shuffle are first-class in MSL.
Apple Neural Engine (ANE): a separate fixed-function accelerator for ML inference. Not accessible via Metal compute shaders — the ANE is used automatically by Core ML and MLX.
Unified memory
No PCIe bus to cross. CPU and GPU buffers share the same address space — ideal for small batches and latency-sensitive workloads.
Neural Engine
Dedicated matrix-multiply hardware. Accessible via Core ML or the MLX framework, not directly from Metal compute shaders.
Metal Shading Language is a C++14-based shading language with GPU-specific qualifiers. If you already know CUDA, the table below maps the most important concepts.
newBufferWithLength:options: (no copy needed for shared)
dim3 gridDim, blockDim
MTLSize passed to dispatchThreads:threadsPerThreadgroup:
MSL uses attribute syntax to bind inputs to a kernel function. [[buffer(0)]] binds to the first buffer slot, [[thread_position_in_grid]] injects the thread’s position automatically.
Before dispatching a kernel you need a compute pipeline state — Metal’s compiled version of your shader. The pipeline is created from a compiled MTLLibrary, which you can either precompile offline or compile from source at runtime.
1
Get a Metal device
id<MTLDevice> getMetalDevice() { NSArray *devices = [MTLCopyAllDevices() autorelease]; if (devices.count == 0) { throw std::runtime_error("Metal is not supported"); } return devices[0];}
Declare threadgroup memory inside a kernel function using the threadgroup address space qualifier. It behaves identically to CUDA __shared__: all threads in a threadgroup see the same allocation, and you must use a barrier before reading data written by another thread.
kernel void tiled_gemm( constant float *A [[buffer(0)]], constant float *B [[buffer(1)]], device float *C [[buffer(2)]], threadgroup float *tile_A [[threadgroup(0)]], threadgroup float *tile_B [[threadgroup(1)]], uint2 tgid [[threadgroup_position_in_grid]], uint2 tid [[thread_position_in_threadgroup]]) { // ... load tile into tile_A and tile_B ... threadgroup_barrier(mem_flags::mem_threadgroup); // ... compute using tile data ...}
Threadgroup memory size is declared at dispatch time in Metal (not in the shader), using setThreadgroupMemoryLength:atIndex: on the encoder. This differs from CUDA, where extern __shared__ arrays are sized via the kernel launch triple-chevron syntax.
Each thread now produces four output values by accumulating a float4x4 × float4 matrix-vector product. The grid is dispatched with N/4 columns and 8×8 threadgroups:
The benchmark harness measures mean dispatch time over 200 iterations and converts to GFLOP/s:
auto gflops = (M * N * K * 1e-9) / measure_time(200, do_compute);std::cout << "Perf of " << shader_name << " dim " << M << "x" << N << "x" << K << " is " << gflops << " GFLOPs" << std::endl;
For a 32×4128×4096 problem on Apple Silicon, the progression from naive → vec4 → mat4 shows meaningful GFLOPs gains at each step, demonstrating the impact of vectorization and output-reuse.
Open the resulting .gputrace file in Xcode to inspect per-shader execution time, occupancy, and memory traffic.
Run the binary under Instruments → Metal System Trace to see GPU timeline, command buffer boundaries, and CPU/GPU synchronization gaps.
xcrun xctrace record --template "Metal System Trace" \ --launch -- ./a.out
In Xcode, attach to a running process or open a .gputrace file. The Shader Profiler view shows per-line ALU and memory utilization, making it easy to spot bottlenecks in the inner loop.
MLX is Apple’s open-source array framework for machine learning on Apple Silicon. It exposes the same unified-memory model as Metal, making it zero-copy with NumPy and PyTorch arrays.
import mlx.core as mx# Arrays live in unified memory — no .to("mps") neededa = mx.random.normal(shape=(1024, 1024))b = mx.random.normal(shape=(1024, 1024))c = a @ b # dispatches an optimized Metal GEMM kernelmx.eval(c) # lazy evaluation — commit the computation graph
MLX ships hand-tuned Metal kernels for common ops (matmul, attention, convolutions) and lets you write custom Metal kernels through its C++ extension API.
MLX documentation
Official MLX API reference and guides for writing custom kernels
Lecture 31 slides
Nikita Shulga’s slides: Beginner’s Guide to Metal Kernels
Metal compute shaders do not have access to the Apple Neural Engine. For ANE acceleration, use Core ML or MLX, which dispatch operations to the ANE automatically when the operation and data type are supported.