Writing Metal Kernels for Apple Silicon GPUs

Apple Silicon brought a fundamentally different GPU architecture to developers — unified memory, a tightly integrated neural engine, and a GPU programming model built on Metal rather than CUDA. This page is a companion to Lecture 31 by Nikita Shulga, which walks through writing GPU compute kernels in Metal Shading Language (MSL) and benchmarks three progressively optimized GEMM implementations on Apple Silicon.

All code examples on this page are drawn from gemm_perf_studies.mm in the lecture repository. Compile with: clang++ --std=c++17 gemm_perf_studies.mm -framework Metal -framework Foundation

Apple Silicon GPU architecture

Apple Silicon (M1/M2/M3/M4 families) integrates CPU, GPU, and Neural Engine on a single die with a shared memory fabric — there is no separate “device memory” to copy to and from. Key characteristics relevant to kernel writers:

Unified memory architecture (UMA): CPU and GPU share the same physical RAM. Allocating a buffer with MTLResourceStorageModeShared makes it readable and writable by both without any memcpy.
Tile memory: fast on-chip scratchpad analogous to CUDA shared memory; called threadgroup memory in MSL.
SIMD groups: groups of 32 threads that execute in lockstep (equivalent to CUDA warps). SIMD-group operations like simd_sum and simd_shuffle are first-class in MSL.
Apple Neural Engine (ANE): a separate fixed-function accelerator for ML inference. Not accessible via Metal compute shaders — the ANE is used automatically by Core ML and MLX.

Unified memory

No PCIe bus to cross. CPU and GPU buffers share the same address space — ideal for small batches and latency-sensitive workloads.

Neural Engine

Dedicated matrix-multiply hardware. Accessible via Core ML or the MLX framework, not directly from Metal compute shaders.

Metal Shading Language vs. CUDA

Metal Shading Language is a C++14-based shading language with GPU-specific qualifiers. If you already know CUDA, the table below maps the most important concepts.

CUDA	Metal (MSL)
`__global__ void kernel(...)`	`kernel void kernel(...)`
`threadIdx`, `blockIdx`	`thread_position_in_threadgroup`, `threadgroup_position_in_grid`
`thread_position_in_grid` (2D)	`uint2 tid [[thread_position_in_grid]]`
`__shared__`	`threadgroup` address space
Warp (32 threads)	SIMD group (32 threads)
`__syncthreads()`	`threadgroup_barrier(mem_flags::mem_threadgroup)`
`cudaMalloc` / `cudaMemcpy`	`newBufferWithLength:options:` (no copy needed for shared)
`dim3 gridDim`, `blockDim`	`MTLSize` passed to `dispatchThreads:threadsPerThreadgroup:`

MSL uses attribute syntax to bind inputs to a kernel function. [[buffer(0)]] binds to the first buffer slot, [[thread_position_in_grid]] injects the thread’s position automatically.

Setting up a Metal compute pipeline

Before dispatching a kernel you need a compute pipeline state — Metal’s compiled version of your shader. The pipeline is created from a compiled MTLLibrary, which you can either precompile offline or compile from source at runtime.

Get a Metal device

id<MTLDevice> getMetalDevice() {
  NSArray *devices = [MTLCopyAllDevices() autorelease];
  if (devices.count == 0) {
    throw std::runtime_error("Metal is not supported");
  }
  return devices[0];
}

Compile a shader library from source

id<MTLLibrary> compileLibraryFromSource(id<MTLDevice> device,
                                        const std::string &source) {
  NSError *error = nil;
  MTLCompileOptions *options = [[MTLCompileOptions new] autorelease];
  [options setLanguageVersion:MTLLanguageVersion3_1];
  id<MTLLibrary> library = [device
      newLibraryWithSource:[NSString stringWithUTF8String:source.c_str()]
                   options:options
                     error:&error];
  if (library == nil) {
    throw std::runtime_error(std::string("Failed to compile: ") +
                             error.description.UTF8String);
  }
  return library;
}

Build the compute pipeline state

id<MTLFunction> func = [lib newFunctionWithName:@"gemm"];
NSError *error = nil;
auto cpl = [lib.device newComputePipelineStateWithFunction:func error:&error];

Allocate shared buffers

id<MTLBuffer> allocSharedBuffer(id<MTLDevice> device, unsigned length) {
  id<MTLBuffer> rc = [device newBufferWithLength:length
                                         options:MTLResourceStorageModeShared];
  if (rc == nil) {
    throw std::runtime_error("Can't allocate " + std::to_string(length) +
                             " bytes on GPU");
  }
  return rc;
}

Encode and dispatch

auto cmdBuffer = [queue commandBuffer];
auto encoder  = [cmdBuffer computeCommandEncoder];
[encoder setComputePipelineState:cpl];
[encoder setBuffer:buf_A offset:0 atIndex:0];
[encoder setBuffer:buf_B offset:0 atIndex:1];
[encoder setBuffer:buf_C offset:0 atIndex:2];
[encoder setBytes:sizes.data()
           length:sizeof(uint32_t) * sizes.size()
          atIndex:3];
[encoder dispatchThreads:MTLSizeMake(N, M, 1)
    threadsPerThreadgroup:group_size];
[encoder endEncoding];
[cmdBuffer commit];
[cmdBuffer waitUntilCompleted];

Thread groups and SIMD groups

Metal’s execution hierarchy maps directly onto how Apple’s GPU hardware schedules work:

Thread: the smallest unit of execution — one invocation of the kernel function.
Threadgroup: a block of threads that share threadgroup (tile) memory and can synchronize with threadgroup_barrier. Analogous to a CUDA thread block.
SIMD group: 32 threads within a threadgroup that execute in lockstep. Analogous to a CUDA warp.
Grid: the full set of threadgroups dispatched for a kernel call.

cpl.maxTotalThreadsPerThreadgroup queries the hardware limit for your compiled pipeline — a good default for 1D dispatches:

const auto maxTpG = [cpl maxTotalThreadsPerThreadgroup];
MTLSize group_size = MTLSizeMake(std::min(static_cast<decltype(M)>(maxTpG), M), 1, 1);

Threadgroup (shared) memory

Declare threadgroup memory inside a kernel function using the threadgroup address space qualifier. It behaves identically to CUDA __shared__: all threads in a threadgroup see the same allocation, and you must use a barrier before reading data written by another thread.

kernel void tiled_gemm(
    constant float *A [[buffer(0)]],
    constant float *B [[buffer(1)]],
    device float   *C [[buffer(2)]],
    threadgroup float *tile_A [[threadgroup(0)]],
    threadgroup float *tile_B [[threadgroup(1)]],
    uint2 tgid  [[threadgroup_position_in_grid]],
    uint2 tid   [[thread_position_in_threadgroup]]) {

  // ... load tile into tile_A and tile_B ...
  threadgroup_barrier(mem_flags::mem_threadgroup);
  // ... compute using tile data ...
}

Threadgroup memory size is declared at dispatch time in Metal (not in the shader), using setThreadgroupMemoryLength:atIndex: on the encoder. This differs from CUDA, where extern __shared__ arrays are sized via the kernel launch triple-chevron syntax.

GEMM kernels: three levels of optimization

The lecture demonstrates three progressively optimized GEMM kernels for multiplying a row-major matrix A (M×K) by a column-major matrix B (K×N).

Naive: one thread per output element

// Naive — one thread per output element
kernel void gemm(constant float *A [[buffer(0)]],
                 constant float *B [[buffer(1)]],
                 device float *outputData [[buffer(2)]],
                 constant uint3 &sizes [[buffer(3)]],
                 uint2 thread_index [[thread_position_in_grid]]) {
  const uint lda = sizes.y;
  const uint ldc = sizes.z;
  const uint m = thread_index.y; // 0..sizes.x-1
  const uint n = thread_index.x; // 0..sizes.z-1
  constant auto *A_ptr = A + m * lda;
  constant auto *B_ptr = B + n * lda;

  float rc = 0.0;
  for (uint k = 0; k < sizes.y; k++) {
    const auto a_val = A_ptr[k];
    const auto b_val = B_ptr[k];
    rc += a_val * b_val;
  }
  outputData[m * ldc + n] = rc;
}

Each thread iterates over the full K dimension. Memory accesses are scalar — no vectorization.

vec4: SIMD vectorization with `float4`

// SIMD (vec4) — one thread per output element, 4-wide dot products
using namespace metal;

kernel void gemm(constant float *A [[buffer(0)]],
                 constant float *B [[buffer(1)]],
                 device float *outputData [[buffer(2)]],
                 constant uint3 &sizes [[buffer(3)]],
                 uint2 thread_index [[thread_position_in_grid]]) {
  const uint lda = sizes.y;
  const uint ldc = sizes.z;
  const uint m = thread_index.y;
  const uint n = thread_index.x;
  constant auto *A_ptr = reinterpret_cast<constant float4 *>(A + m * lda);
  constant auto *B_ptr = reinterpret_cast<constant float4 *>(B + n * lda);

  float rc = 0.0;
  for (uint k = 0; k < sizes.y / 4; k++) {
    rc += dot(A_ptr[k], B_ptr[k]);
  }
  outputData[m * ldc + n] = rc;
}

reinterpret_cast to float4* lets MSL’s dot() built-in compute four multiply-adds per instruction. The loop trip count drops by 4×.

mat4: matrix-vector products for 4 outputs per thread

// SIMD (mat4×vec4) — one thread computes 4 output elements, 8×8 threadgroups
using namespace metal;

kernel void gemm(constant float *A [[buffer(0)]],
                 constant float *B [[buffer(1)]],
                 device float *outputData [[buffer(2)]],
                 constant uint3 &sizes [[buffer(3)]],
                 uint2 thread_index [[thread_position_in_grid]]) {
  const uint lda = sizes.y;
  const uint ldc = sizes.z;
  const uint m = thread_index.y;
  const uint n = thread_index.x; // 0..sizes.z/4-1
  constant auto *A_ptr = reinterpret_cast<constant float4 *>(A + m * lda);
  constant auto *B_ptr = reinterpret_cast<constant float4 *>(B + n * 4 * lda);

  float4 rc = 0.0;
  for (uint k = 0; k < sizes.y / 4; k++) {
    float4x4 b_mat;
    for (int j = 0; j < 4; ++j) {
      b_mat[j] = B_ptr[k + j * lda / 4];
    }
    rc += transpose(b_mat) * A_ptr[k];
  }
  reinterpret_cast<device float4*>(outputData + m * ldc)[n] = rc;
}

Each thread now produces four output values by accumulating a float4x4 × float4 matrix-vector product. The grid is dispatched with N/4 columns and 8×8 threadgroups:

MTLSize group_size = MTLSizeMake(8, 8, 1);
[encoder dispatchThreads:MTLSizeMake(N / 4, M, 1)
    threadsPerThreadgroup:group_size];

Benchmarking the kernels

The benchmark harness measures mean dispatch time over 200 iterations and converts to GFLOP/s:

auto gflops = (M * N * K * 1e-9) / measure_time(200, do_compute);
std::cout << "Perf of " << shader_name << " dim "
          << M << "x" << N << "x" << K
          << " is " << gflops << " GFLOPs" << std::endl;

For a 32×4128×4096 problem on Apple Silicon, the progression from naive → vec4 → mat4 shows meaningful GFLOPs gains at each step, demonstrating the impact of vectorization and output-reuse.

Profiling with Instruments and Metal Debugger

Metal provides first-class tooling inside Xcode for GPU performance analysis.

GPU trace capture
Instruments
Metal Debugger

The benchmark code captures a GPU trace automatically when MTL_CAPTURE_ENABLED is set in the environment:

auto captureManager    = [MTLCaptureManager sharedCaptureManager];
auto captureDescriptor = [MTLCaptureDescriptor new];
auto gpuTraceString    = [NSString stringWithFormat:@"%s.gputrace",
                                                     shader_name.c_str()];
captureDescriptor.captureObject   = queue;
captureDescriptor.destination     = MTLCaptureDestinationGPUTraceDocument;
captureDescriptor.outputURL       = [NSURL fileURLWithPath:gpuTraceString];
[captureManager startCaptureWithDescriptor:captureDescriptor error:nil];
do_compute();
[captureManager stopCapture];

Open the resulting .gputrace file in Xcode to inspect per-shader execution time, occupancy, and memory traffic.

Run the binary under Instruments → Metal System Trace to see GPU timeline, command buffer boundaries, and CPU/GPU synchronization gaps.

xcrun xctrace record --template "Metal System Trace" \
    --launch -- ./a.out

In Xcode, attach to a running process or open a .gputrace file. The Shader Profiler view shows per-line ALU and memory utilization, making it easy to spot bottlenecks in the inner loop.

MLX: ML on Apple Silicon

MLX is Apple’s open-source array framework for machine learning on Apple Silicon. It exposes the same unified-memory model as Metal, making it zero-copy with NumPy and PyTorch arrays.

import mlx.core as mx

# Arrays live in unified memory — no .to("mps") needed
a = mx.random.normal(shape=(1024, 1024))
b = mx.random.normal(shape=(1024, 1024))
c = a @ b          # dispatches an optimized Metal GEMM kernel
mx.eval(c)         # lazy evaluation — commit the computation graph

MLX ships hand-tuned Metal kernels for common ops (matmul, attention, convolutions) and lets you write custom Metal kernels through its C++ extension API.

MLX documentation

Official MLX API reference and guides for writing custom kernels

Lecture 31 slides

Nikita Shulga’s slides: Beginner’s Guide to Metal Kernels

Metal compute shaders do not have access to the Apple Neural Engine. For ANE acceleration, use Core ML or MLX, which dispatch operations to the ANE automatically when the operation and data type are supported.

Getting Started

CUDA Fundamentals

Advanced GPU Programming

Triton & High-Level Frameworks

Quantization & Optimization

Multi-GPU & Systems

Hardware Targets

ScaleML Series

Writing Metal Kernels for Apple Silicon GPUs

Apple Silicon GPU architecture

Unified memory

Neural Engine

Metal Shading Language vs. CUDA

Setting up a Metal compute pipeline

Thread groups and SIMD groups

Threadgroup (shared) memory

GEMM kernels: three levels of optimization

Naive: one thread per output element

vec4: SIMD vectorization with `float4`

mat4: matrix-vector products for 4 outputs per thread

Benchmarking the kernels

Profiling with Instruments and Metal Debugger

MLX: ML on Apple Silicon

MLX documentation

Lecture 31 slides

Build docs developers (and LLMs) love

Getting Started

CUDA Fundamentals

Advanced GPU Programming

Triton & High-Level Frameworks

Quantization & Optimization

Multi-GPU & Systems

Hardware Targets

ScaleML Series

Documentation Index

​Apple Silicon GPU architecture

Unified memory

Neural Engine

​Metal Shading Language vs. CUDA

​Setting up a Metal compute pipeline

​Thread groups and SIMD groups

​Threadgroup (shared) memory

​GEMM kernels: three levels of optimization

​Naive: one thread per output element

​vec4: SIMD vectorization with float4

​mat4: matrix-vector products for 4 outputs per thread

​Benchmarking the kernels

​Profiling with Instruments and Metal Debugger

​MLX: ML on Apple Silicon

MLX documentation

Lecture 31 slides

Build docs developers (and LLMs) love

Apple Silicon GPU architecture

Metal Shading Language vs. CUDA

Setting up a Metal compute pipeline

Thread groups and SIMD groups

Threadgroup (shared) memory

GEMM kernels: three levels of optimization

Naive: one thread per output element

vec4: SIMD vectorization with `float4`

mat4: matrix-vector products for 4 outputs per thread

Benchmarking the kernels

Profiling with Instruments and Metal Debugger

MLX: ML on Apple Silicon