ARM CPUs power the majority of the world’s mobile devices, edge accelerators, and a growing number of cloud servers. Optimizing matrix operations for ARM means understanding SIMD instruction sets — NEON for fixed-width 128-bit vectors, SVE for scalable-width vectors — and exploiting low-bit quantization formats (INT8, INT4) that increase arithmetic throughput while shrinking memory bandwidth. This page accompanies Lecture 38 by Scott Roy.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/gpu-mode/lectures/llms.txt
Use this file to discover all available pages before exploring further.
Lecture 38 slides are available in the lecture repository at
lecture_038/lowbit_kernels.pdf.Why ARM CPU optimization matters
GPU availability is not universal. A large fraction of ML inference workloads run on ARM CPUs — in smartphones, IoT devices, on-premises edge servers, and increasingly in ARM-based cloud instances (AWS Graviton, Ampere Altra). Even when a GPU is present, the CPU often handles data preprocessing, tokenization, and small-batch inference where GPU launch overhead dominates.Mobile and edge
Smartphones and microcontrollers run ARM cores exclusively. On-device LLM inference (e.g., llama.cpp) depends entirely on NEON/SVE performance.
Cloud servers
AWS Graviton3, Ampere Altra, and Apple M-series chips are ARM-based. Cost per inference can be lower than x86 for many workloads.
Memory bandwidth
Low-bit quantization reduces memory traffic — the dominant bottleneck for transformer inference — making INT8 and INT4 essential for throughput on ARM.
ARM NEON SIMD: 128-bit vector registers
NEON is ARM’s Advanced SIMD extension, present on every modern ARM application processor (Cortex-A, Apple Silicon, Graviton). Each NEON register is 128 bits wide and can hold:- 16 × INT8 (or UINT8)
- 8 × INT16
- 4 × INT32
- 4 × FP32
- 8 × FP16
v prefix denotes a NEON operation. The type suffix encodes the lane count and element type: s8 = signed 8-bit, u8 = unsigned 8-bit, s32 = signed 32-bit, q = 128-bit (quad) register.
Key NEON intrinsics for matrix multiply
ARM SVE: scalable vector extension
SVE (Scalable Vector Extension) generalizes NEON by making the vector length a runtime parameter. Instead of hardcoding 128-bit registers, SVE uses scalable vectors whose width ranges from 128 to 2048 bits in 128-bit increments depending on the hardware implementation (e.g., 256-bit on Fujitsu A64FX, 512-bit on AWS Graviton3). This means SVE code written once runs correctly on all SVE implementations — the compiler does not need to know the vector length at compile time.svcntb() returns the number of active bytes per vector at runtime, so the loop is entirely hardware-agnostic.
SVE2 (introduced with ARMv9) extends SVE with additional instructions including
SMMLA and UMMLA for 8×8 INT8 matrix multiply accumulate, directly targeting the same use cases as NVIDIA’s tensor core instructions.Low-bit quantization on ARM: INT8 and INT4
Quantization reduces weight and activation precision to shrink memory footprint and improve throughput. On ARM:- INT8: fully supported in NEON/SVE via SDOT/UDOT instructions (see below). Typical accuracy loss < 0.5% on most LLMs with per-channel weight quantization.
- INT4: requires dequantization before arithmetic on current NEON hardware. Weights stored as INT4 (2 values per byte) are unpacked to INT8 before being fed to SDOT, halving the memory read volume.
SDOT and UDOT: INT8 matrix multiply instructions
TheSDOT (signed dot product) and UDOT (unsigned dot product) instructions, introduced in ARMv8.2-A, are the key to efficient INT8 GEMM on ARM. Each instruction takes a group of 4 INT8 values from each operand and accumulates their dot product into a 32-bit accumulator — one cycle for four multiply-adds.
In C intrinsics:
vsdotq_s32 call processes 16 INT8 multiplications in one instruction, compared to 4 with vmull_s8 + widening accumulate.
I8MM: INT8 matrix multiply extension
The I8MM (Int8 Matrix Multiply) extension, part of ARMv8.6-A, addsSMMLA and UMMLA — instructions that perform an 8×8 INT8 matrix tile multiplication in a single instruction, accumulating into a 2×2 block of INT32 accumulators.
I8MM is available on Cortex-X1C, A78, and newer; AWS Graviton3; and Apple Silicon from M1 onward. Detect at runtime with
HWCAP2_I8MM from sys/auxv.h on Linux, or at compile time with #ifdef __ARM_FEATURE_MATMUL_INT8.Memory layout for ARM cache hierarchy
Cache efficiency is as important as instruction throughput on ARM. ARM Cortex and Neoverse cores typically use:- L1 cache: 32–64 KB, 4-cycle latency
- L2 cache: 256 KB – 1 MB, 12–15-cycle latency
- L3 / system cache: 4–32 MB, 30–50-cycle latency
A, B, and C tiles fits in L1 or L2.
Prefetching
ARM CPUs have hardware prefetchers that detect sequential strides, but software prefetch hints help for indirect or strided access patterns:Benchmarking on ARM hardware
Measure wall-clock time withclock_gettime(CLOCK_MONOTONIC) on Linux, or mach_absolute_time() on macOS/Apple Silicon. Avoid std::chrono in tight loops due to syscall overhead.
Useful profiling tools
- Linux perf
- ARM Streamline
- Valgrind / cachegrind
Reference implementations
Production-quality ARM INT8 kernels are in these open-source libraries:XNNPACK
Google’s optimized neural network operators for ARM. Used by TensorFlow Lite and PyTorch Mobile. Hand-written NEON and SVE assembly for GEMM, convolution, and depthwise ops.
llama.cpp
LLM inference on CPU. Contains highly optimized ARM NEON kernels for INT4 and INT8 quantized GEMM, targeting Apple Silicon and mobile Cortex-A devices.
torchao
PyTorch quantization and sparsity library. Includes ARM-specific lowbit kernel backends contributed by the PyTorch team.
Lecture 38 slides
Scott Roy’s full slide deck covering low-bit kernels for ARM CPUs.