Skip to main content
The following model families are supported. Download .gguf files from HuggingFace and pass them to --model.
ModelNotes
LLaMA-4Vision and text variants
LLaMA-3-NemotronNVIDIA reasoning model
Qwen3Dense and instruct variants
Qwen3-VLVision-language variant
Qwen3-NextRecurrent hybrid; use --ctx-checkpoints
Qwen3.5-MoEMoE variant; use --ctx-checkpoints
GLM-4 / 4.5 / 5Includes AIR, Flash, and MoE sub-variants
Command-ACohere Command-A
BitNet b1.581-bit quantization natively supported
DeepSeek-V3 / R1MLA-based MoE models
Gemma 3Google Gemma 3 family
Kimi-2Moonshot Kimi-2
grok-2xAI grok-2
SmolLM3HuggingFace compact model
ministral3Mistral 3B
HunyuanTencent Hunyuan
DeepSeek-V3 and DeepSeek-R1 use Multi-head Latent Attention (MLA). Pass -mla 3 (the default) for best performance. Lower values reduce memory use at a speed cost.
Do not use Unsloth models with _XL in their name that contain f16 tensors. These models are incompatible with ik_llama.cpp. Unsloth _XL models that do not use f16 tensors are fine.

Build docs developers (and LLMs) love