ik_llama.cpp is a fork of llama.cpp focused on maximising CPU and hybrid GPU/CPU inference performance. It ships new state-of-the-art quantization types, first-class DeepSeek support via FlashMLA, fused MoE operations, and fine-grained tensor placement controls — all while staying compatible with standard GGUF model files.
Key differences from mainline llama.cpp
New quantization types: IQK and Trellis
New quantization types: IQK and Trellis
ik_llama.cpp introduces two families of novel quantization types not present in mainline llama.cpp:IQK quants (IQ2_K, IQ3_K, IQ4_K, IQ5_K, IQ6_K, IQ4_KS, IQ5_KS, IQ2_KL, and _R4 row-interleaved variants) deliver better quality-per-bit ratios than standard k-quants, with hand-optimised GEMM/GEMV kernels for AVX2, Zen4, ARM NEON, and CUDA.Trellis quants (IQ1_KT, IQ2_KT, IQ3_KT, IQ4_KT) use a novel integer-based trellis structure to achieve strong compression ratios while maintaining reasonable CPU performance.FlashMLA and MLA for DeepSeek models
FlashMLA and MLA for DeepSeek models
Multi-head Latent Attention (MLA) is implemented with Flash Attention support (
FlashMLA), giving significantly faster prompt processing and token generation for DeepSeek-V2/V3/R1 and similar models. FlashMLA-3 is available for both CPU and CUDA (Ampere or newer).Fused MoE operations
Fused MoE operations
Mixture-of-Experts (MoE) layers are computed with fused FFN operations, reducing overhead and improving throughput for models like DeepSeek, Qwen3-MoE, and Command-A.
Tensor overrides and hybrid GPU/CPU inference
Tensor overrides and hybrid GPU/CPU inference
The
--tensor-override / -ot flag lets you selectively route individual tensors (e.g. large FFN weight blocks) to CPU RAM while keeping attention tensors in VRAM. This enables running large MoE models on setups where VRAM alone is insufficient.Smart Expert Reduction (SER)
Smart Expert Reduction (SER)
SER skips low-activation experts at runtime, reducing compute for MoE models with minimal quality impact — particularly effective for DeepSeek inference.
Multi-GPU graph split mode
Multi-GPU graph split mode
A new
--split-mode graph distributes the compute graph across multiple GPUs, complementing the existing layer and row split modes for better multi-GPU utilisation.Quantization quality improvements
Quantization quality improvements
Existing llama.cpp quantization types (
Q2_K through Q6_K, IQ1_M, IQ2_XS, IQ4_NL, etc.) are re-quantized with improved algorithms, producing higher-quality models at the same bit-width.Supported backends
The only fully functional and performant compute backends are CPU (AVX2 or better, ARM NEON or better) and CUDA. ROCm, Vulkan, and Metal are inherited from llama.cpp but are not actively maintained in this fork. Do not open issues for those backends unless you are prepared to contribute a fix.
| Backend | Status | Notes |
|---|---|---|
| CPU (x86, AVX2/AVX-512) | Fully supported | Best-in-class kernel optimisations for Zen4, AVX2 |
| CPU (ARM NEON / Apple Silicon) | Fully supported | Optimised NEON kernels |
| CUDA (Nvidia) | Fully supported | Ampere or newer recommended for FlashMLA |
| Metal (macOS) | Limited | Enabled by default on macOS; not actively maintained |
| ROCm / hipBLAS (AMD) | Limited | Not actively maintained |
| Vulkan | Limited | Not actively maintained |
Where to go next
Quickstart
Build and run ik_llama.cpp in minutes on CPU or GPU.
Building from source
Detailed build instructions for all platforms and backends.
Quantization
Learn about IQK, Trellis, and other quantization types unique to ik_llama.cpp.
Supported models
Browse the full list of supported model architectures.