The concept
In a MoE model such as DeepSeek-V3 or Qwen3-30B-A3B, the total weight file is large, but each token only activates a small fraction of the expert tensors. The strategy is:- VRAM: attention layers, embedding, normalization, shared experts, and any experts that fit
- RAM: the sparse
ffn_*_expsexpert tensors that are rarely activated
Step-by-step workflow
Find the model size
Check the total GGUF file size on disk. This is the minimum RAM + VRAM needed.Identify the number of layers and tensor names:You can also browse the model on HuggingFace — click any
.gguf file and scroll to the Tensors table.Check available VRAM
Decide your strategy
| Situation | Recommended approach |
|---|---|
| Model fits entirely in VRAM | -ngl 999 (full GPU) |
| MoE model, experts don’t fit | --cpu-moe or -ot "\.ffn_.*_exps\.=CPU" |
| Some MoE layers fit, others don’t | --n-cpu-moe N or fine-grained -ot regex |
| Dense model, partial fit | -ngl N for N layers |
| Very limited VRAM | -ngl 0 --no-kv-offload (CPU only) |
MoE offload options
Simple: —cpu-moe
Keeps all MoE expert weights in RAM with a single flag. The easiest starting point:Partial: —n-cpu-moe N
Keeps MoE weights for the first N layers in RAM, allowing the rest to live in VRAM. Use this when you have enough VRAM for some but not all expert layers:Fine-grained: -ot regex
-ot / --override-tensor matches tensor names by regex and assigns them to a device. This is the most powerful option, letting you target specific layers and tensor types.
Pattern explanation:
| Part | Meaning |
|---|---|
blk. | Matches block (layer) prefix |
(?:[0-9]|[1-7][0-9]|[8][0-7]) | Layers 0–87 (single digit, 10–79, 80–87) |
.ffn._exps. | Any tensor with ffn and _exps in the name (up/gate/down experts) |
=CPU | Place these tensors on CPU |
KV cache strategies for limited VRAM
The KV cache size scales with context length. At long contexts it can consume several GiB of VRAM.Quantize the KV cache
Reduce KV cache VRAM usage by quantizing K and V from f16:-DGGML_IQK_FA_ALL_QUANTS=ON for more):
| Type | Notes |
|---|---|
f16 | Default. Full quality. |
q8_0 | Half the size, minimal quality loss. Best starting point. |
q8_KV | Fast 8-bit KV type specific to ik_llama.cpp. |
q6_0 | Slightly smaller, still good quality. |
bf16 | Available on CPUs with native BF16 support. |
K-cache is more sensitive than V-cache. If you use different quantization levels, use a higher quality for K:
-ctk q8_0 -ctv q6_0.--k-cache-hadamard with heavily quantized KV caches (below Q6_0) to improve output quality:
Keep KV cache in RAM
If VRAM is very tight, you can keep the entire KV cache in system RAM:Smart Expert Reduction (SER)
SER reduces the number of active experts below the model default, trading output quality for speed:Quantization choices
Smaller quantizations reduce total model size, making more of the model fit in VRAM:| Quant | Notes |
|---|---|
BF16 | Too large for most setups. Use only for reference. |
Q8_0 | Near-BF16 quality at half the size. |
Q6_0 / IQ6_K | Minimal quality loss vs Q8_0. |
IQ5_K | Close to Q8_0 quality, significantly smaller. |
IQ4_XS, IQ4_NL | Minimal loss. Good for large models. |
IQ4_KS, IQ4_KSS | ik_llama.cpp exclusive, excellent quality/size ratio. |
IQ3_K | Usable quality; imatrix strongly recommended. |
IQ2_K, IQ2_KS | Very small; noticeable quality loss without imatrix. |
quantize.imatrix.* fields to see if the file was already quantized with one.
Practical example: Qwen3-30B-A3B on Zen4 CPU + single GPU
This is a real-world configuration for running Qwen3-30B-A3B (30B total, 3B active) with experts in RAM and attention layers in VRAM:-ngl 999: load all non-expert layers to VRAM-ot "\.ffn_.*_exps\.=CPU": keep all expert weight tensors in RAM-fa: Flash Attention (reduces VRAM usage and speeds up PP)-ctk q8_0 -ctv q8_0: quantize KV cache to save VRAM-t 8: 8 CPU threads for generation (match physical core count)-tb 2: fewer threads for batch processing when GPU handles most of it
Related pages
- GPU offloading — Full GPU offload parameter reference
- Parameters reference — Complete CLI reference