What is MLA?
Multi-Head Latent Attention (MLA) is the attention mechanism used in DeepSeek-V2, DeepSeek-V3, DeepSeek-R1, and similar models. Instead of storing full key-value heads in the KV cache, MLA compresses them into a low-rank latent representation. This dramatically reduces the KV cache memory footprint — at the cost of requiring a matrix decomposition step during inference.ik_llama.cpp implements MLA natively and adds a Flash Attention-style kernel on top of it, called FlashMLA. The current version, FlashMLA-3, is the recommended implementation and is available for both CPU and CUDA.
The -mla flag
Use -mla (or --mla-use) to select the MLA implementation:
| Value | Behaviour |
|---|---|
0 | MLA disabled — standard KV cache |
1 | MLA enabled, original transposed cache |
2 | MLA enabled, non-transposed cache |
3 | FlashMLA-3 — recommended (default) |
3. You rarely need to change this unless you are debugging or comparing implementations. For all DeepSeek models, leave it at the default.
CUDA requirements
FlashMLA on CUDA requires an Ampere or newer GPU (compute capability 8.0+, e.g. RTX 30xx, A100, H100). Turing GPUs (RTX 20xx) support the earlier flash attention path for DeepSeek models but not FlashMLA-3.Performance
FlashMLA-3 delivers the fastest CPU-only DeepSeek inference currently available. On CUDA, it reduces KV cache memory and improves token generation speed compared to standard MHA by exploiting the compressed latent representation. See the DeepSeek guide for benchmarks and configuration examples.KV cache with MLA
MLA works with quantized KV caches. The recommended cache type isq8_0:
--k-cache-hadamard to apply a Hadamard transform to the K cache before quantization. This typically recovers quality lost at low bit widths:
Example command
A complete command for running DeepSeek-V3 with FlashMLA on GPU:-ngl:
Smart Expert Reduction (SER)
SER lets you reduce the number of active experts in a MoE model at runtime, trading a small amount of quality for faster inference. Use the-ser Kmin,t flag:
Kmin— the minimum number of experts to keep activet— threshold mode; set to1to use a fixed count of exactlyKminexperts
Kmin preserve more quality; lower values give more speed. Experiment to find the right balance for your workload.