Introduction

New quantization types: IQK and Trellis

ik_llama.cpp introduces two families of novel quantization types not present in mainline llama.cpp:IQK quants (IQ2_K, IQ3_K, IQ4_K, IQ5_K, IQ6_K, IQ4_KS, IQ5_KS, IQ2_KL, and _R4 row-interleaved variants) deliver better quality-per-bit ratios than standard k-quants, with hand-optimised GEMM/GEMV kernels for AVX2, Zen4, ARM NEON, and CUDA.Trellis quants (IQ1_KT, IQ2_KT, IQ3_KT, IQ4_KT) use a novel integer-based trellis structure to achieve strong compression ratios while maintaining reasonable CPU performance.

FlashMLA and MLA for DeepSeek models

Multi-head Latent Attention (MLA) is implemented with Flash Attention support (FlashMLA), giving significantly faster prompt processing and token generation for DeepSeek-V2/V3/R1 and similar models. FlashMLA-3 is available for both CPU and CUDA (Ampere or newer).

Fused MoE operations

Mixture-of-Experts (MoE) layers are computed with fused FFN operations, reducing overhead and improving throughput for models like DeepSeek, Qwen3-MoE, and Command-A.

Tensor overrides and hybrid GPU/CPU inference

The --tensor-override / -ot flag lets you selectively route individual tensors (e.g. large FFN weight blocks) to CPU RAM while keeping attention tensors in VRAM. This enables running large MoE models on setups where VRAM alone is insufficient.

Smart Expert Reduction (SER)

SER skips low-activation experts at runtime, reducing compute for MoE models with minimal quality impact — particularly effective for DeepSeek inference.

Multi-GPU graph split mode

A new --split-mode graph distributes the compute graph across multiple GPUs, complementing the existing layer and row split modes for better multi-GPU utilisation.

Quantization quality improvements

Existing llama.cpp quantization types (Q2_K through Q6_K, IQ1_M, IQ2_XS, IQ4_NL, etc.) are re-quantized with improved algorithms, producing higher-quality models at the same bit-width.

Backend	Status	Notes
CPU (x86, AVX2/AVX-512)	Fully supported	Best-in-class kernel optimisations for Zen4, AVX2
CPU (ARM NEON / Apple Silicon)	Fully supported	Optimised NEON kernels
CUDA (Nvidia)	Fully supported	Ampere or newer recommended for FlashMLA
Metal (macOS)	Limited	Enabled by default on macOS; not actively maintained
ROCm / hipBLAS (AMD)	Limited	Not actively maintained
Vulkan	Limited	Not actively maintained

Quickstart

Build and run ik_llama.cpp in minutes on CPU or GPU.

Building from source

Detailed build instructions for all platforms and backends.

Quantization

Learn about IQK, Trellis, and other quantization types unique to ik_llama.cpp.

Supported models

Browse the full list of supported model architectures.

Get Started

Inference

Quantization

Advanced Features

Deployment

Key differences from mainline llama.cpp

Supported backends

Where to go next

Quickstart

Building from source

Quantization

Supported models

Build docs developers (and LLMs) love

Get Started

Inference

Quantization

Advanced Features

Deployment

Documentation Index

​Key differences from mainline llama.cpp

​Supported backends

​Where to go next

Quickstart

Building from source

Quantization

Supported models

Build docs developers (and LLMs) love

Key differences from mainline llama.cpp

Supported backends

Where to go next