Skip to main content

Prerequisites

On Debian/Ubuntu:
apt-get update && apt-get install build-essential git libcurl4-openssl-dev curl libgomp1 cmake

CMake flags

Pass flags to the initial cmake -B build invocation.
FlagDefaultDescription
GGML_NATIVEOFFOptimize for the host CPU (-march=native). Turn off when cross-compiling.
GGML_CUDAOFFBuild with CUDA support. Requires the NVIDIA CUDA Toolkit. Defaults to native CUDA architecture detection.
CMAKE_CUDA_ARCHITECTURESautoTarget a specific GPU compute capability, e.g. 86 for RTX 30-series.
GGML_RPCOFFBuild the RPC backend for distributed inference across machines.
GGML_IQK_FA_ALL_QUANTSOFFEnable all KV cache quantization types for Flash Attention (beyond the default f16, q8_0, q6_0, and bf16).
GGML_NCCLONEnable NCCL for multi-GPU communication. Set to OFF to disable.
LLAMA_SERVER_SQLITE3OFFBuild SQLite3 support into llama-server (required for the mikupad web UI).
CPU build example
cmake -B build -DGGML_NATIVE=ON
cmake --build build --config Release -j$(nproc)
CUDA build example
cmake -B build -DGGML_NATIVE=ON -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86
cmake --build build --config Release -j$(nproc)

Environment variables

Set these in the shell before invoking llama-server or any other tool.
VariableDescription
CUDA_VISIBLE_DEVICESRestrict which GPUs are visible. Example: CUDA_VISIBLE_DEVICES=0,2 uses the first and third GPU only.
GGML_CUDA_ENABLE_UNIFIED_MEMORYSet to 1 to enable CUDA Unified Memory, allowing the GPU to access host RAM when VRAM is exhausted. Useful for large models on systems with limited VRAM.
CUDA_VISIBLE_DEVICES=0,2 llama-server --model /models/model.gguf -ngl 999
The only fully supported compute backends are CPU (AVX2 or better, ARM NEON or better) and CUDA. ROCm, Vulkan, and Metal are available but not actively maintained.

Build docs developers (and LLMs) love