Skip to main content
ik_llama.cpp is a fork of llama.cpp focused on maximising CPU and hybrid GPU/CPU inference performance. It ships new state-of-the-art quantization types, first-class DeepSeek support via FlashMLA, fused MoE operations, and fine-grained tensor placement controls — all while staying compatible with standard GGUF model files.

Key differences from mainline llama.cpp

ik_llama.cpp introduces two families of novel quantization types not present in mainline llama.cpp:IQK quants (IQ2_K, IQ3_K, IQ4_K, IQ5_K, IQ6_K, IQ4_KS, IQ5_KS, IQ2_KL, and _R4 row-interleaved variants) deliver better quality-per-bit ratios than standard k-quants, with hand-optimised GEMM/GEMV kernels for AVX2, Zen4, ARM NEON, and CUDA.Trellis quants (IQ1_KT, IQ2_KT, IQ3_KT, IQ4_KT) use a novel integer-based trellis structure to achieve strong compression ratios while maintaining reasonable CPU performance.
Multi-head Latent Attention (MLA) is implemented with Flash Attention support (FlashMLA), giving significantly faster prompt processing and token generation for DeepSeek-V2/V3/R1 and similar models. FlashMLA-3 is available for both CPU and CUDA (Ampere or newer).
Mixture-of-Experts (MoE) layers are computed with fused FFN operations, reducing overhead and improving throughput for models like DeepSeek, Qwen3-MoE, and Command-A.
The --tensor-override / -ot flag lets you selectively route individual tensors (e.g. large FFN weight blocks) to CPU RAM while keeping attention tensors in VRAM. This enables running large MoE models on setups where VRAM alone is insufficient.
SER skips low-activation experts at runtime, reducing compute for MoE models with minimal quality impact — particularly effective for DeepSeek inference.
A new --split-mode graph distributes the compute graph across multiple GPUs, complementing the existing layer and row split modes for better multi-GPU utilisation.
Existing llama.cpp quantization types (Q2_K through Q6_K, IQ1_M, IQ2_XS, IQ4_NL, etc.) are re-quantized with improved algorithms, producing higher-quality models at the same bit-width.

Supported backends

The only fully functional and performant compute backends are CPU (AVX2 or better, ARM NEON or better) and CUDA. ROCm, Vulkan, and Metal are inherited from llama.cpp but are not actively maintained in this fork. Do not open issues for those backends unless you are prepared to contribute a fix.
BackendStatusNotes
CPU (x86, AVX2/AVX-512)Fully supportedBest-in-class kernel optimisations for Zen4, AVX2
CPU (ARM NEON / Apple Silicon)Fully supportedOptimised NEON kernels
CUDA (Nvidia)Fully supportedAmpere or newer recommended for FlashMLA
Metal (macOS)LimitedEnabled by default on macOS; not actively maintained
ROCm / hipBLAS (AMD)LimitedNot actively maintained
VulkanLimitedNot actively maintained

Where to go next

Quickstart

Build and run ik_llama.cpp in minutes on CPU or GPU.

Building from source

Detailed build instructions for all platforms and backends.

Quantization

Learn about IQK, Trellis, and other quantization types unique to ik_llama.cpp.

Supported models

Browse the full list of supported model architectures.

Build docs developers (and LLMs) love