Skip to main content
ik_llama.cpp uses the CPU as its base compute device. “Offloading” means sending specific tensors and operations to the GPU for processing. Because GPUs have faster memory bandwidth and parallel compute compared to CPU+RAM, the goal is to offload as much as possible to maximize tokens/second.
For MoE models (DeepSeek, Qwen3-MoE, etc.), always pass a number larger than the model’s actual layer count with -ngl. Use -ngl 999 as a safe catch-all — the runtime caps it at the actual layer count automatically.

Core offload parameters

-ngl / —gpu-layers

Offload the first N transformer layers to VRAM. Pass 999 to offload everything:
# Offload all layers
llama-server -m /models/model.gguf -ngl 999

# Partial offload: first 40 of 80 layers
llama-server -m /models/model.gguf -ngl 40
To find the exact layer count, open the GGUF file on HuggingFace and scroll to the Tensors table, or run:
python3 gguf-py/scripts/gguf_dump.py /models/model.gguf

-ot / —override-tensor

Override where individual tensors are stored using regular expressions. This is the most powerful offload control available, particularly useful for MoE models where you want experts in RAM and everything else in VRAM.
# Put all expert tensors (ffn_*_exps) back on CPU
-ngl 999 -ot "\.ffn_.*_exps\.=CPU"

# Put experts for layers 0-87 on CPU (example for a 94-layer model)
-ngl 999 -ot "blk.(?:[0-9]|[1-7][0-9]|[8][0-7]).ffn._exps.=CPU"
The pattern before = is a regex matched against tensor names. The value after = is the target device (CPU, CUDA0, CUDA1, etc.).
Tensor names follow the pattern blk.N.tensor_name. Run gguf_dump.py on your model to list all tensor names and identify the right regex pattern.

—fit / —fit-margin

Automatically load as many tensors as available VRAM permits, without specifying an explicit layer count.
# Auto-fit with default 1024 MiB safety margin
llama-server -m /models/model.gguf --fit

# Larger margin to avoid OOM (e.g. for large KV cache)
llama-server -m /models/model.gguf --fit --fit-margin 2048
ParameterDefaultNotes
--fitoffAutomatically fills VRAM. Cannot be combined with --cpu-moe, --n-cpu-moe, or -ot.
--fit-margin N1024 MiBIncrease if you get CUDA OOM during model load. Decrease if too much VRAM is left unused.

Multi-GPU configuration

For a single GPU, use -ngl 999 to fully offload, or a lower number for partial offload:
# Full offload to primary GPU
llama-server -m /models/model.gguf \
  -ngl 999 \
  -fa

# Partial offload with KV cache in VRAM
llama-server -m /models/model.gguf \
  -ngl 40 \
  -fa \
  -ctk q8_0 -ctv q8_0
Use -mg to select which GPU to use when multiple are present but you only want one:
-mg 1   # Use second GPU (index 1)

MoE-specific offload options

For Mixture-of-Experts models, ik_llama.cpp provides dedicated parameters to control where expert weights live:
ParameterDescription
--cpu-moeKeep all MoE expert weights in RAM. Simple one-flag hybrid setup.
--n-cpu-moe NKeep MoE weights of the first N layers in RAM. Useful when some VRAM is available.
-ooae / --offload-only-active-expertsWhen expert weights are in RAM, only copy the activated experts to VRAM for computation (reduces RAM→VRAM transfer). Default: ON.
-no-ooaeDisable active-expert-only offload. May help when nearly all experts are activated (large batches).

Per-operation offload control

-op / --offload-policy gives fine-grained control over which GGML operations run on GPU:
# Disable all GPU offload
-op -1,0

# Disable matrix multiplication offload only
-op 26,0

# Disable indirect matmul (MoE experts) offload
-op 27,0

# Multiple operations
-op 26,0,27,0

CUDA fine-tuning

-cuda / --cuda-params accepts a comma-separated list of CUDA-specific tuning options, including fusion control, GPU offload threshold, and MMQ-ID threshold:
-cuda graphs=0          # Disable CUDA graphs (workaround for graph-split + hybrid issues)
The FP16 precision offset for Flash Attention at long contexts:
-cuda fa-offset=1.0     # Fix FP16 overflow in FA for very long contexts

Practical examples

llama-server \
  -m /models/Qwen3-8B-Q6_K.gguf \
  -ngl 999 \
  -fa \
  --ctx-size 8192

Build docs developers (and LLMs) love