Build ik_llama.cpp for CPU, CUDA, Metal, ROCm, and other backends
ik_llama.cpp has a minimal set of dependencies: cmake, a C++17-capable compiler, and — for GPU builds — the CUDA toolkit. All are available from the system package manager on Linux.
The only fully supported and performant backends are CPU (AVX2/ARM NEON) and CUDA. Metal, ROCm/hipBLAS, and Vulkan are inherited from the llama.cpp upstream but are not actively maintained in this fork. Issues with those backends will only be resolved if contributors step up to fix them.
Use CUDA_VISIBLE_DEVICES to select which GPU(s) to use at runtime. Set GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 to allow swapping to system RAM when VRAM is exhausted (Linux only; on Windows use the NVIDIA Control Panel setting “System Memory Fallback”).
Metal is enabled by default on macOS. No extra flags are needed:
The Metal backend is not actively maintained in ik_llama.cpp. For best performance on Apple Silicon, the CPU backend with ARM NEON kernels is well-optimised and recommended.
ik_llama.cpp builds and runs on Android via Termux. For step-by-step instructions, see the Android deployment guide.On Windows ARM64 (WoA), build with:
Detects your CPU’s feature set at compile time (AVX2, AVX-512, ARM NEON, etc.) and generates optimised code for it. Highly recommended for local builds. Omit this flag if you need a portable binary that runs on older CPUs.
-DGGML_CUDA=ON
Enables the CUDA backend for Nvidia GPU acceleration. Requires the CUDA Toolkit to be installed.
-DCMAKE_CUDA_ARCHITECTURES=<arch>
Limits CUDA compilation to specific GPU compute capabilities, dramatically reducing build time. Common values:
GPU generation
Compute capability
Turing (RTX 20xx)
75
Ampere (RTX 30xx / A100)
80, 86
Ada Lovelace (RTX 40xx)
89
Hopper (H100)
90
Example: -DCMAKE_CUDA_ARCHITECTURES=86
-DGGML_IQK_FA_ALL_QUANTS=ON
Compiles support for all KV cache quantization type combinations in the Flash Attention CUDA kernels. Enables more fine-grained control over KV cache size (e.g. using IQ4_K for K-cache and IQ3_K for V-cache together with Flash Attention). Significantly increases compilation time.
-DGGML_RPC=ON
Enables the RPC backend, which allows offloading compute to a remote machine. Useful in distributed or heterogeneous multi-machine setups.
-DGGML_NCCL=OFF
Disables NCCL (NVIDIA Collective Communications Library) support. NCCL is off by default; set this explicitly if your environment has NCCL installed and you want to avoid linking it.
-DLLAMA_SERVER_SQLITE3=ON
Enables SQLite3 support in llama-server, required for the mikupad alternative web UI. Make sure libsqlite3-dev (or equivalent) is installed before configuring.
Debug builds
For single-config generators (default on Linux/macOS):
The following is a step-by-step walkthrough for a successful Windows CUDA build using clang-cl via Visual Studio Build Tools.
1
Install CUDA Toolkit and Visual Studio Build Tools
Download CUDA 12.6 from Nvidia. During installation, select custom setup and uncheck Driver components and PhysX (not needed in a VM).
Download Visual Studio Build Tools 2022. During setup, go to the Individual components tab, search for clang, and add the clang-related tools (they are not selected by default).
Building with BLAS support can improve prompt processing throughput for large batch sizes (above 32 tokens). It does not affect token generation speed on CPU-only builds.
gfx1100 corresponds to the Radeon RX 7900 XTX/XT/GRE (RDNA3). Adjust for your GPU.
To enable Unified Memory Architecture (UMA) for APUs or integrated GPUs (hurts performance on discrete GPUs):
-DGGML_HIP_UMA=ON
The environment variable HIP_VISIBLE_DEVICES selects which GPU(s) to use at runtime. If your GPU is not officially supported, try setting HSA_OVERRIDE_GFX_VERSION to a similar architecture (e.g. 10.3.0 for RDNA2, 11.0.0 for RDNA3).