The fully supported backends are CPU (AVX2 or better, ARM NEON or better) and CUDA. ROCm, Vulkan, and Metal are available but not actively maintained in this fork.
-ngl 999 offloads all layers to VRAM. Reduce the number if the model does not fit entirely in VRAM.Open http://127.0.0.1:8080 to start chatting.
FlashMLA (for DeepSeek models) requires an Ampere or newer NVIDIA GPU. For DeepSeek inference, also add -mla 3 -fa to your command.
For the best quantization quality, look for models with IQK quants (IQ4_KS, IQ5_K, IQ3_K) or Trellis quants (IQ2_KT, IQ3_KT) on HuggingFace. These are exclusive to ik_llama.cpp and outperform standard k-quants at the same bit-width.