Configure GPU offloading to maximize inference performance with CUDA
ik_llama.cpp uses the CPU as its base compute device. “Offloading” means sending specific tensors and operations to the GPU for processing. Because GPUs have faster memory bandwidth and parallel compute compared to CPU+RAM, the goal is to offload as much as possible to maximize tokens/second.
For MoE models (DeepSeek, Qwen3-MoE, etc.), always pass a number larger than the model’s actual layer count with -ngl. Use -ngl 999 as a safe catch-all — the runtime caps it at the actual layer count automatically.
Override where individual tensors are stored using regular expressions. This is the most powerful offload control available, particularly useful for MoE models where you want experts in RAM and everything else in VRAM.
# Put all expert tensors (ffn_*_exps) back on CPU-ngl 999 -ot "\.ffn_.*_exps\.=CPU"# Put experts for layers 0-87 on CPU (example for a 94-layer model)-ngl 999 -ot "blk.(?:[0-9]|[1-7][0-9]|[8][0-7]).ffn._exps.=CPU"
The pattern before = is a regex matched against tensor names. The value after = is the target device (CPU, CUDA0, CUDA1, etc.).
Tensor names follow the pattern blk.N.tensor_name. Run gguf_dump.py on your model to list all tensor names and identify the right regex pattern.
For a single GPU, use -ngl 999 to fully offload, or a lower number for partial offload:
# Full offload to primary GPUllama-server -m /models/model.gguf \ -ngl 999 \ -fa# Partial offload with KV cache in VRAMllama-server -m /models/model.gguf \ -ngl 40 \ -fa \ -ctk q8_0 -ctv q8_0
Use -mg to select which GPU to use when multiple are present but you only want one:
-mg 1 # Use second GPU (index 1)
ik_llama.cpp adds the graph split mode, which is highly effective for both dense and MoE models across multiple GPUs — including mixed GPU types with different VRAM sizes.
# Graph split across all available GPUsllama-server -m /models/model.gguf \ -ngl 999 \ -sm graph \ -fa# Control the fraction each GPU receivesllama-server -m /models/model.gguf \ -ngl 999 \ -sm graph \ -ts 3,1 # 75% GPU 0, 25% GPU 1
Split modes:
Mode
Description
none
Single GPU only (default)
layer
Distribute layers across GPUs
graph
Distribute computation graph across GPUs. Best for mixed GPU setups.
Inter-GPU transfer type (-grt): controls the data type used when transferring activations between GPUs. Lower precision reduces bandwidth at some quality cost:
-grt q8_0 # Smallest transfer, minimal quality loss-grt bf16 # Good balance-grt f16 # Default-equivalent-grt f32 # Full precision
If you observe incoherent responses with split mode graph and partial offload, add -cuda graphs=0 to your command line.
Limit GPU count with --max-gpu N when using more than 2 GPUs actually hurts performance:
--max-gpu 2
Select specific GPUs with -dev or the environment variable:
-cuda / --cuda-params accepts a comma-separated list of CUDA-specific tuning options, including fusion control, GPU offload threshold, and MMQ-ID threshold:
-cuda graphs=0 # Disable CUDA graphs (workaround for graph-split + hybrid issues)
The FP16 precision offset for Flash Attention at long contexts:
-cuda fa-offset=1.0 # Fix FP16 overflow in FA for very long contexts