Quickstart

The fully supported backends are CPU (AVX2 or better, ARM NEON or better) and CUDA. ROCm, Vulkan, and Metal are available but not actively maintained in this fork.

CPU
GPU (CUDA)

Clone the repository

git clone https://github.com/ikawrakow/ik_llama.cpp
cd ik_llama.cpp

Install dependencies (Debian/Ubuntu)

apt-get update && apt-get install build-essential git libcurl4-openssl-dev curl libgomp1 cmake

On other distros, install the equivalent packages for your package manager.

Build

cmake -B build -DGGML_NATIVE=ON
cmake --build build --config Release -j$(nproc)

-DGGML_NATIVE=ON enables CPU-specific optimisations (AVX2, AVX-512, ARM NEON) for your machine. Omit it when cross-compiling.

Download a model

Download any GGUF model from HuggingFace. IQK quants from bartowski or ubergarm give the best quality/size tradeoff.

# Example: Qwen3 0.6B — ~400 MB, good for testing
# https://huggingface.co/bartowski/Qwen_Qwen3-0.6B-GGUF

Start the server

./build/bin/llama-server --model /path/to/model.gguf --ctx-size 4096

Open http://127.0.0.1:8080 in your browser to start chatting.

Install prerequisites

Install the CUDA Toolkit from NVIDIA, then clone the repository:

git clone https://github.com/ikawrakow/ik_llama.cpp
cd ik_llama.cpp

Build with CUDA support

cmake -B build -DGGML_NATIVE=ON -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)

To target a specific GPU architecture (e.g. RTX 3090 = compute capability 8.6):

cmake -B build -DGGML_NATIVE=ON -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86

Download a model

Download a GGUF model from HuggingFace. For GPU inference, larger quants (IQ4_KS, Q6_K) are practical since VRAM is fast.

Start the server with GPU offload

./build/bin/llama-server --model /path/to/model.gguf --ctx-size 4096 -ngl 999

-ngl 999 offloads all layers to VRAM. Reduce the number if the model does not fit entirely in VRAM.Open http://127.0.0.1:8080 to start chatting.

FlashMLA (for DeepSeek models) requires an Ampere or newer NVIDIA GPU. For DeepSeek inference, also add -mla 3 -fa to your command.

For the best quantization quality, look for models with IQK quants (IQ4_KS, IQ5_K, IQ3_K) or Trellis quants (IQ2_KT, IQ3_KT) on HuggingFace. These are exclusive to ik_llama.cpp and outperform standard k-quants at the same bit-width.

Next steps

Building from source

Detailed build options for Windows, macOS, ROCm, and more.

GPU offloading

Fine-tune layer and tensor placement for maximum performance.

Quantization types

Understand IQK, Trellis, and how to pick the right quant.

Server reference

All server options, API endpoints, and authentication.

Get Started

Inference

Quantization

Advanced Features

Deployment

Next steps

Building from source

GPU offloading

Quantization types

Server reference

Build docs developers (and LLMs) love

Get Started

Inference

Quantization

Advanced Features

Deployment

Documentation Index

​Next steps

Building from source

GPU offloading

Quantization types

Server reference

Build docs developers (and LLMs) love

Next steps