Skip to main content
The fully supported backends are CPU (AVX2 or better, ARM NEON or better) and CUDA. ROCm, Vulkan, and Metal are available but not actively maintained in this fork.
1

Clone the repository

git clone https://github.com/ikawrakow/ik_llama.cpp
cd ik_llama.cpp
2

Install dependencies (Debian/Ubuntu)

apt-get update && apt-get install build-essential git libcurl4-openssl-dev curl libgomp1 cmake
On other distros, install the equivalent packages for your package manager.
3

Build

cmake -B build -DGGML_NATIVE=ON
cmake --build build --config Release -j$(nproc)
-DGGML_NATIVE=ON enables CPU-specific optimisations (AVX2, AVX-512, ARM NEON) for your machine. Omit it when cross-compiling.
4

Download a model

Download any GGUF model from HuggingFace. IQK quants from bartowski or ubergarm give the best quality/size tradeoff.
# Example: Qwen3 0.6B — ~400 MB, good for testing
# https://huggingface.co/bartowski/Qwen_Qwen3-0.6B-GGUF
5

Start the server

./build/bin/llama-server --model /path/to/model.gguf --ctx-size 4096
Open http://127.0.0.1:8080 in your browser to start chatting.
For the best quantization quality, look for models with IQK quants (IQ4_KS, IQ5_K, IQ3_K) or Trellis quants (IQ2_KT, IQ3_KT) on HuggingFace. These are exclusive to ik_llama.cpp and outperform standard k-quants at the same bit-width.

Next steps

Building from source

Detailed build options for Windows, macOS, ROCm, and more.

GPU offloading

Fine-tune layer and tensor placement for maximum performance.

Quantization types

Understand IQK, Trellis, and how to pick the right quant.

Server reference

All server options, API endpoints, and authentication.

Build docs developers (and LLMs) love