Quickstart
Download a model and start the server in minutes — no GPU required.
Building from source
Build with CPU, CUDA, Metal, or ROCm support.
GPU offloading
Maximize performance by offloading layers to one or more GPUs.
Quantization types
Explore IQK, Trellis, and other SOTA quant formats unique to ik_llama.cpp.
Why ik_llama.cpp?
ik_llama.cpp delivers measurable improvements over mainline llama.cpp across every dimension of local inference:SOTA quantization
New IQK and Trellis quantization families provide higher quality at lower bit-widths than standard k-quants, enabling larger models to fit in memory without quality loss.
FlashMLA for DeepSeek
Optimized Multi-Head Latent Attention kernels for DeepSeek models deliver industry-leading CPU-only and hybrid inference throughput.
Hybrid CPU/GPU inference
Tensor overrides and MoE-specific offload controls let you precisely place model weights across VRAM and RAM for maximum efficiency.
OpenAI-compatible server
Drop-in replacement for OpenAI API with chat completions, embeddings, function calling, and a built-in WebUI.
Speculative decoding
Multiple speculative decoding strategies — draft model, n-gram, and ngram-mod — for faster token generation.
Broad model support
Supports DeepSeek, Qwen3, LLaMA-4, Gemma3, GLM-4, BitNet, and dozens more architectures.
Get started
Download a model
Download any GGUF model from HuggingFace. IQK quantizations from bartowski or ubergarm are recommended for best quality/size tradeoffs.
Start the server
Launch the inference server and open the built-in WebUI in your browser.Open http://127.0.0.1:8080 to start chatting.
The fully supported and performant backends are CPU (AVX2 or better, ARM NEON or better) and CUDA. ROCm, Vulkan, and Metal have limited support.