Skip to main content
ik_llama.cpp has a minimal set of dependencies: cmake, a C++17-capable compiler, and — for GPU builds — the CUDA toolkit. All are available from the system package manager on Linux.
The only fully supported and performant backends are CPU (AVX2/ARM NEON) and CUDA. Metal, ROCm/hipBLAS, and Vulkan are inherited from the llama.cpp upstream but are not actively maintained in this fork. Issues with those backends will only be resolved if contributors step up to fix them.

Prerequisites

apt-get update && apt-get install \
  build-essential git cmake \
  libcurl4-openssl-dev curl libgomp1
Get the source:
git clone https://github.com/ikawrakow/ik_llama.cpp
cd ik_llama.cpp

CMake build

CMake is the recommended build method on all platforms.
cmake -B build -DGGML_NATIVE=ON
cmake --build build --config Release -j$(nproc)
On macOS, Metal is enabled by default. To disable it:
cmake -B build -DGGML_NATIVE=ON -DGGML_METAL=OFF
cmake --build build --config Release -j$(nproc)

Important CMake flags

Detects your CPU’s feature set at compile time (AVX2, AVX-512, ARM NEON, etc.) and generates optimised code for it. Highly recommended for local builds. Omit this flag if you need a portable binary that runs on older CPUs.
Enables the CUDA backend for Nvidia GPU acceleration. Requires the CUDA Toolkit to be installed.
Limits CUDA compilation to specific GPU compute capabilities, dramatically reducing build time. Common values:
GPU generationCompute capability
Turing (RTX 20xx)75
Ampere (RTX 30xx / A100)80, 86
Ada Lovelace (RTX 40xx)89
Hopper (H100)90
Example: -DCMAKE_CUDA_ARCHITECTURES=86
Compiles support for all KV cache quantization type combinations in the Flash Attention CUDA kernels. Enables more fine-grained control over KV cache size (e.g. using IQ4_K for K-cache and IQ3_K for V-cache together with Flash Attention). Significantly increases compilation time.
Enables the RPC backend, which allows offloading compute to a remote machine. Useful in distributed or heterogeneous multi-machine setups.
Disables NCCL (NVIDIA Collective Communications Library) support. NCCL is off by default; set this explicitly if your environment has NCCL installed and you want to avoid linking it.
Enables SQLite3 support in llama-server, required for the mikupad alternative web UI. Make sure libsqlite3-dev (or equivalent) is installed before configuring.
For single-config generators (default on Linux/macOS):
cmake -B build -DCMAKE_BUILD_TYPE=Debug
cmake --build build
For multi-config generators (Visual Studio, Xcode):
cmake -B build -G "Xcode"
cmake --build build --config Debug

Windows build

The following is a step-by-step walkthrough for a successful Windows CUDA build using clang-cl via Visual Studio Build Tools.
1

Install CUDA Toolkit and Visual Studio Build Tools

  • Download CUDA 12.6 from Nvidia. During installation, select custom setup and uncheck Driver components and PhysX (not needed in a VM).
  • Download Visual Studio Build Tools 2022. During setup, go to the Individual components tab, search for clang, and add the clang-related tools (they are not selected by default).
2

Clone the repository

Download Portable Git and clone:
git.exe clone https://github.com/ikawrakow/ik_llama.cpp "C:\Downloads\ik_llama.cpp"
cd "C:\Downloads\ik_llama.cpp"
3

Set up environment variables

set VS_DIR=c:/Program Files (x86)/Microsoft Visual Studio/2022/BuildTools
call "%VS_DIR%\VC\Auxiliary\Build\vcvarsall.bat" x64
set LLVM_DIR=c:/Program Files (x86)/Microsoft Visual Studio/2022/BuildTools/VC/Tools/Llvm/x64
set CUDA_DIR=C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v12.6
set "PATH=%LLVM_DIR%/bin;%CUDA_DIR%/bin;%PATH%"
4

Configure with CMake

Adjust -DCMAKE_CUDA_ARCHITECTURES to match your GPU and /clang:-march= to match your CPU:
"c:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\Common7\IDE\CommonExtensions\Microsoft\CMake\CMake\bin\cmake.exe" ^
    -G Ninja ^
    -S "C:/Downloads/ik_llama.cpp" ^
    -B "C:/Downloads/output" ^
    -DCMAKE_C_COMPILER="%LLVM_DIR%/bin/clang-cl.exe" ^
    -DCMAKE_CXX_COMPILER="%LLVM_DIR%/bin/clang-cl.exe" ^
    -DCMAKE_CUDA_COMPILER="%CUDA_DIR%/bin/nvcc.exe" ^
    -DCUDAToolkit_ROOT="%CUDA_DIR%" ^
    -DCMAKE_CUDA_ARCHITECTURES="89-real" ^
    -DCMAKE_BUILD_TYPE=Release ^
    -DGGML_CUDA=ON ^
    -DLLAMA_CURL=OFF ^
    -DCMAKE_C_FLAGS="/clang:-march=znver4 /clang:-fvectorize /clang:-ffp-model=fast /clang:-fno-finite-math-only" ^
    -DCMAKE_CXX_FLAGS="/EHsc /clang:-march=znver4 /clang:-fvectorize /clang:-ffp-model=fast /clang:-fno-finite-math-only" ^
    -DCMAKE_CUDA_STANDARD=17 ^
    -DGGML_AVX512=ON ^
    -DGGML_AVX512_VNNI=ON ^
    -DGGML_AVX512_VBMI=ON ^
    -DGGML_CUDA_USE_GRAPHS=ON ^
    -DGGML_OPENMP=ON
Use forward slashes (/) in all cmake paths on Windows. Backslashes can be misinterpreted by CMake as escape characters.
5

Build

"...\CMake\bin\cmake.exe" --build "C:/Downloads/output" --config Release
6

Copy CUDA runtime DLLs

Copy the following DLLs from C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\bin to C:\Downloads\output\bin:
  • cublas64_12.dll
  • cublasLt64_12.dll
  • cudart64_12.dll
Also copy libomp140.x86_64.dll from C:\Windows\System32\ to the same output bin directory.

BLAS acceleration

Building with BLAS support can improve prompt processing throughput for large batch sizes (above 32 tokens). It does not affect token generation speed on CPU-only builds.
cmake -B build -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
cmake --build build --config Release
Make sure libopenblas-dev (or equivalent) is installed first.
Source the oneAPI environment and then build:
source /opt/intel/oneapi/setvars.sh
cmake -B build \
  -DGGML_BLAS=ON \
  -DGGML_BLAS_VENDOR=Intel10_64lp \
  -DCMAKE_C_COMPILER=icx \
  -DCMAKE_CXX_COMPILER=icpx \
  -DGGML_NATIVE=ON
cmake --build build --config Release
If you are using the oneAPI-basekit Docker image, you can skip the setvars.sh step.
Accelerate is enabled by default on macOS — no extra flags are needed. Use the standard CMake build instructions.

hipBLAS / ROCm

The ROCm/hipBLAS backend is not actively maintained in ik_llama.cpp. Use it at your own risk.
Install ROCm first. Then build, specifying your AMD GPU target:
# Example for gfx1030 (RX 6000 series / RDNA2)
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
    cmake -S . -B build \
      -DGGML_HIPBLAS=ON \
      -DAMDGPU_TARGETS=gfx1030 \
      -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -- -j$(nproc)
Find your GPU target with:
rocminfo | grep gfx | head -1 | awk '{print $2}'
To enable Unified Memory Architecture (UMA) for APUs or integrated GPUs (hurts performance on discrete GPUs):
-DGGML_HIP_UMA=ON
The environment variable HIP_VISIBLE_DEVICES selects which GPU(s) to use at runtime. If your GPU is not officially supported, try setting HSA_OVERRIDE_GFX_VERSION to a similar architecture (e.g. 10.3.0 for RDNA2, 11.0.0 for RDNA3).

Vulkan

The Vulkan backend is not actively maintained in ik_llama.cpp. Use it at your own risk.
Install the Vulkan SDK:
wget -qO - https://packages.lunarg.com/lunarg-signing-key-pub.asc | apt-key add -
wget -qO /etc/apt/sources.list.d/lunarg-vulkan-jammy.list \
    https://packages.lunarg.com/vulkan/lunarg-vulkan-jammy.list
apt update -y && apt-get install -y vulkan-sdk
Then build:
cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release

Build docs developers (and LLMs) love