Skip to main content
ik_llama.cpp is a fork of llama.cpp focused on pushing the performance envelope for local LLM inference. It delivers new state-of-the-art quantization types, first-class DeepSeek support via FlashMLA, fused Mixture-of-Experts operations, and fine-grained GPU/CPU hybrid offloading — all while maintaining full compatibility with GGUF model files.

Quickstart

Download a model and start the server in minutes — no GPU required.

Building from source

Build with CPU, CUDA, Metal, or ROCm support.

GPU offloading

Maximize performance by offloading layers to one or more GPUs.

Quantization types

Explore IQK, Trellis, and other SOTA quant formats unique to ik_llama.cpp.

Why ik_llama.cpp?

ik_llama.cpp delivers measurable improvements over mainline llama.cpp across every dimension of local inference:

SOTA quantization

New IQK and Trellis quantization families provide higher quality at lower bit-widths than standard k-quants, enabling larger models to fit in memory without quality loss.

FlashMLA for DeepSeek

Optimized Multi-Head Latent Attention kernels for DeepSeek models deliver industry-leading CPU-only and hybrid inference throughput.

Hybrid CPU/GPU inference

Tensor overrides and MoE-specific offload controls let you precisely place model weights across VRAM and RAM for maximum efficiency.

OpenAI-compatible server

Drop-in replacement for OpenAI API with chat completions, embeddings, function calling, and a built-in WebUI.

Speculative decoding

Multiple speculative decoding strategies — draft model, n-gram, and ngram-mod — for faster token generation.

Broad model support

Supports DeepSeek, Qwen3, LLaMA-4, Gemma3, GLM-4, BitNet, and dozens more architectures.

Get started

1

Clone and build

Clone the repository and build for your platform (CPU or GPU).
git clone https://github.com/ikawrakow/ik_llama.cpp
cd ik_llama.cpp
cmake -B build -DGGML_NATIVE=ON
cmake --build build --config Release -j$(nproc)
2

Download a model

Download any GGUF model from HuggingFace. IQK quantizations from bartowski or ubergarm are recommended for best quality/size tradeoffs.
# Example: Qwen3 0.6B in IQ4_NL format (~400MB)
# Download from: https://huggingface.co/bartowski/Qwen_Qwen3-0.6B-GGUF
3

Start the server

Launch the inference server and open the built-in WebUI in your browser.
# CPU inference
./build/bin/llama-server --model /path/to/model.gguf --ctx-size 4096

# GPU inference (offload all layers)
./build/bin/llama-server --model /path/to/model.gguf --ctx-size 4096 -ngl 999
Open http://127.0.0.1:8080 to start chatting.
The fully supported and performant backends are CPU (AVX2 or better, ARM NEON or better) and CUDA. ROCm, Vulkan, and Metal have limited support.

Build docs developers (and LLMs) love