Building from source

ik_llama.cpp has a minimal set of dependencies: cmake, a C++17-capable compiler, and — for GPU builds — the CUDA toolkit. All are available from the system package manager on Linux.

The only fully supported and performant backends are CPU (AVX2/ARM NEON) and CUDA. Metal, ROCm/hipBLAS, and Vulkan are inherited from the llama.cpp upstream but are not actively maintained in this fork. Issues with those backends will only be resolved if contributors step up to fix them.

Prerequisites

Linux (Debian/Ubuntu)
macOS
Windows
FreeBSD

apt-get update && apt-get install \
  build-essential git cmake \
  libcurl4-openssl-dev curl libgomp1

Install Xcode command-line tools and CMake:

xcode-select --install
brew install cmake

sudo pkg install gmake automake autoconf pkgconf llvm15 openblas

Get the source:

git clone https://github.com/ikawrakow/ik_llama.cpp
cd ik_llama.cpp

CMake build

CMake is the recommended build method on all platforms.

CPU-only (Linux/macOS)
CUDA (Nvidia GPU)
Metal (macOS)
Windows (x64)
Android / ARM

cmake -B build -DGGML_NATIVE=ON
cmake --build build --config Release -j$(nproc)

On macOS, Metal is enabled by default. To disable it:

cmake -B build -DGGML_NATIVE=ON -DGGML_METAL=OFF
cmake --build build --config Release -j$(nproc)

Install the CUDA Toolkit first (apt install nvidia-cuda-toolkit works on Ubuntu).

cmake -B build -DGGML_NATIVE=ON -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)

To compile only for a specific GPU compute capability (recommended — reduces compile time significantly):

# Ampere (RTX 30xx): 86
# Ada Lovelace (RTX 40xx): 89
cmake -B build -DGGML_NATIVE=ON -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86
cmake --build build --config Release -j$(nproc)

Use CUDA_VISIBLE_DEVICES to select which GPU(s) to use at runtime. Set GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 to allow swapping to system RAM when VRAM is exhausted (Linux only; on Windows use the NVIDIA Control Panel setting “System Memory Fallback”).

Metal is enabled by default on macOS. No extra flags are needed:

cmake -B build
cmake --build build --config Release

To disable GPU inference at runtime without recompiling:

./build/bin/llama-server --model model.gguf -ngl 0

The Metal backend is not actively maintained in ik_llama.cpp. For best performance on Apple Silicon, the CPU backend with ARM NEON kernels is well-optimised and recommended.

ik_llama.cpp builds and runs on Android via Termux. For step-by-step instructions, see the Android deployment guide.On Windows ARM64 (WoA), build with:

cmake --preset arm64-windows-llvm-release -D GGML_OPENMP=OFF
cmake --build build-arm64-windows-llvm-release

MSVC does not support the inline ARM assembly used in accelerated kernels (e.g. Q4_0_4_8). Use the LLVM/clang preset for best performance on ARM.

Important CMake flags

-DGGML_NATIVE=ON

Detects your CPU’s feature set at compile time (AVX2, AVX-512, ARM NEON, etc.) and generates optimised code for it. Highly recommended for local builds. Omit this flag if you need a portable binary that runs on older CPUs.

-DGGML_CUDA=ON

Enables the CUDA backend for Nvidia GPU acceleration. Requires the CUDA Toolkit to be installed.

-DCMAKE_CUDA_ARCHITECTURES=<arch>

Limits CUDA compilation to specific GPU compute capabilities, dramatically reducing build time. Common values:

GPU generation	Compute capability
Turing (RTX 20xx)	`75`
Ampere (RTX 30xx / A100)	`80`, `86`
Ada Lovelace (RTX 40xx)	`89`
Hopper (H100)	`90`

Example: -DCMAKE_CUDA_ARCHITECTURES=86

-DGGML_IQK_FA_ALL_QUANTS=ON

Compiles support for all KV cache quantization type combinations in the Flash Attention CUDA kernels. Enables more fine-grained control over KV cache size (e.g. using IQ4_K for K-cache and IQ3_K for V-cache together with Flash Attention). Significantly increases compilation time.

-DGGML_RPC=ON

Enables the RPC backend, which allows offloading compute to a remote machine. Useful in distributed or heterogeneous multi-machine setups.

-DGGML_NCCL=OFF

Disables NCCL (NVIDIA Collective Communications Library) support. NCCL is off by default; set this explicitly if your environment has NCCL installed and you want to avoid linking it.

-DLLAMA_SERVER_SQLITE3=ON

Enables SQLite3 support in llama-server, required for the mikupad alternative web UI. Make sure libsqlite3-dev (or equivalent) is installed before configuring.

Debug builds

For single-config generators (default on Linux/macOS):

cmake -B build -DCMAKE_BUILD_TYPE=Debug
cmake --build build

For multi-config generators (Visual Studio, Xcode):

cmake -B build -G "Xcode"
cmake --build build --config Debug

Windows build

The following is a step-by-step walkthrough for a successful Windows CUDA build using clang-cl via Visual Studio Build Tools.

Install CUDA Toolkit and Visual Studio Build Tools

Download CUDA 12.6 from Nvidia. During installation, select custom setup and uncheck Driver components and PhysX (not needed in a VM).
Download Visual Studio Build Tools 2022. During setup, go to the Individual components tab, search for clang, and add the clang-related tools (they are not selected by default).

Clone the repository

Download Portable Git and clone:

git.exe clone https://github.com/ikawrakow/ik_llama.cpp "C:\Downloads\ik_llama.cpp"
cd "C:\Downloads\ik_llama.cpp"

Set up environment variables

set VS_DIR=c:/Program Files (x86)/Microsoft Visual Studio/2022/BuildTools
call "%VS_DIR%\VC\Auxiliary\Build\vcvarsall.bat" x64
set LLVM_DIR=c:/Program Files (x86)/Microsoft Visual Studio/2022/BuildTools/VC/Tools/Llvm/x64
set CUDA_DIR=C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v12.6
set "PATH=%LLVM_DIR%/bin;%CUDA_DIR%/bin;%PATH%"

Configure with CMake

Adjust -DCMAKE_CUDA_ARCHITECTURES to match your GPU and /clang:-march= to match your CPU:

"c:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\Common7\IDE\CommonExtensions\Microsoft\CMake\CMake\bin\cmake.exe" ^
    -G Ninja ^
    -S "C:/Downloads/ik_llama.cpp" ^
    -B "C:/Downloads/output" ^
    -DCMAKE_C_COMPILER="%LLVM_DIR%/bin/clang-cl.exe" ^
    -DCMAKE_CXX_COMPILER="%LLVM_DIR%/bin/clang-cl.exe" ^
    -DCMAKE_CUDA_COMPILER="%CUDA_DIR%/bin/nvcc.exe" ^
    -DCUDAToolkit_ROOT="%CUDA_DIR%" ^
    -DCMAKE_CUDA_ARCHITECTURES="89-real" ^
    -DCMAKE_BUILD_TYPE=Release ^
    -DGGML_CUDA=ON ^
    -DLLAMA_CURL=OFF ^
    -DCMAKE_C_FLAGS="/clang:-march=znver4 /clang:-fvectorize /clang:-ffp-model=fast /clang:-fno-finite-math-only" ^
    -DCMAKE_CXX_FLAGS="/EHsc /clang:-march=znver4 /clang:-fvectorize /clang:-ffp-model=fast /clang:-fno-finite-math-only" ^
    -DCMAKE_CUDA_STANDARD=17 ^
    -DGGML_AVX512=ON ^
    -DGGML_AVX512_VNNI=ON ^
    -DGGML_AVX512_VBMI=ON ^
    -DGGML_CUDA_USE_GRAPHS=ON ^
    -DGGML_OPENMP=ON

Use forward slashes (/) in all cmake paths on Windows. Backslashes can be misinterpreted by CMake as escape characters.

Build

"...\CMake\bin\cmake.exe" --build "C:/Downloads/output" --config Release

Copy CUDA runtime DLLs

Copy the following DLLs from C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\bin to C:\Downloads\output\bin:

cublas64_12.dll
cublasLt64_12.dll
cudart64_12.dll

Also copy libomp140.x86_64.dll from C:\Windows\System32\ to the same output bin directory.

BLAS acceleration

Building with BLAS support can improve prompt processing throughput for large batch sizes (above 32 tokens). It does not affect token generation speed on CPU-only builds.

OpenBLAS (Linux)

cmake -B build -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
cmake --build build --config Release

Make sure libopenblas-dev (or equivalent) is installed first.

Intel oneMKL

Source the oneAPI environment and then build:

source /opt/intel/oneapi/setvars.sh
cmake -B build \
  -DGGML_BLAS=ON \
  -DGGML_BLAS_VENDOR=Intel10_64lp \
  -DCMAKE_C_COMPILER=icx \
  -DCMAKE_CXX_COMPILER=icpx \
  -DGGML_NATIVE=ON
cmake --build build --config Release

If you are using the oneAPI-basekit Docker image, you can skip the setvars.sh step.

Accelerate Framework (macOS)

Accelerate is enabled by default on macOS — no extra flags are needed. Use the standard CMake build instructions.

hipBLAS / ROCm

The ROCm/hipBLAS backend is not actively maintained in ik_llama.cpp. Use it at your own risk.

Install ROCm first. Then build, specifying your AMD GPU target:

Linux (CMake)
Windows (CMake)

# Example for gfx1030 (RX 6000 series / RDNA2)
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
    cmake -S . -B build \
      -DGGML_HIPBLAS=ON \
      -DAMDGPU_TARGETS=gfx1030 \
      -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -- -j$(nproc)

Find your GPU target with:

rocminfo | grep gfx | head -1 | awk '{print $2}'

set PATH=%HIP_PATH%\bin;%PATH%
cmake -S . -B build -G Ninja ^
    -DAMDGPU_TARGETS=gfx1100 ^
    -DGGML_HIPBLAS=ON ^
    -DCMAKE_C_COMPILER=clang ^
    -DCMAKE_CXX_COMPILER=clang++ ^
    -DCMAKE_BUILD_TYPE=Release
cmake --build build

gfx1100 corresponds to the Radeon RX 7900 XTX/XT/GRE (RDNA3). Adjust for your GPU.

To enable Unified Memory Architecture (UMA) for APUs or integrated GPUs (hurts performance on discrete GPUs):

-DGGML_HIP_UMA=ON

The environment variable HIP_VISIBLE_DEVICES selects which GPU(s) to use at runtime. If your GPU is not officially supported, try setting HSA_OVERRIDE_GFX_VERSION to a similar architecture (e.g. 10.3.0 for RDNA2, 11.0.0 for RDNA3).

Vulkan

The Vulkan backend is not actively maintained in ik_llama.cpp. Use it at your own risk.

Linux (Ubuntu)
Windows (MSYS2)
Docker

Install the Vulkan SDK:

wget -qO - https://packages.lunarg.com/lunarg-signing-key-pub.asc | apt-key add -
wget -qO /etc/apt/sources.list.d/lunarg-vulkan-jammy.list \
    https://packages.lunarg.com/vulkan/lunarg-vulkan-jammy.list
apt update -y && apt-get install -y vulkan-sdk

Then build:

cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release

Install dependencies in a UCRT terminal:

pacman -S git \
    mingw-w64-ucrt-x86_64-gcc \
    mingw-w64-ucrt-x86_64-cmake \
    mingw-w64-ucrt-x86_64-vulkan-devel \
    mingw-w64-ucrt-x86_64-shaderc

Build:

cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release

The Docker image handles Vulkan SDK installation automatically:

docker build -t llama-cpp-vulkan -f .devops/llama-cli-vulkan.Dockerfile .

docker run -it --rm \
    -v "$(pwd):/app:Z" \
    --device /dev/dri/renderD128:/dev/dri/renderD128 \
    --device /dev/dri/card1:/dev/dri/card1 \
    llama-cpp-vulkan \
    -m "/app/models/model.gguf" -ngl 33

Get Started

Inference

Quantization

Advanced Features

Deployment

Prerequisites

CMake build

Important CMake flags

Windows build

BLAS acceleration

hipBLAS / ROCm

Vulkan

Build docs developers (and LLMs) love

Get Started

Inference

Quantization

Advanced Features

Deployment

Documentation Index

​Prerequisites

​CMake build

​Important CMake flags

​Windows build

​BLAS acceleration

​hipBLAS / ROCm

​Vulkan

Build docs developers (and LLMs) love

Prerequisites

CMake build

Important CMake flags

Windows build

BLAS acceleration

hipBLAS / ROCm

Vulkan