Skip to main content
Hybrid inference lets you run models that are too large to fit entirely in GPU VRAM by keeping some weights in system RAM and the rest in VRAM. The key insight for MoE (Mixture-of-Experts) models is that expert weights are activated only 2–5% of the time — making them ideal candidates to stay in RAM while attention and non-expert layers live in the faster VRAM.

The concept

In a MoE model such as DeepSeek-V3 or Qwen3-30B-A3B, the total weight file is large, but each token only activates a small fraction of the expert tensors. The strategy is:
  • VRAM: attention layers, embedding, normalization, shared experts, and any experts that fit
  • RAM: the sparse ffn_*_exps expert tensors that are rarely activated
ik_llama.cpp is specifically engineered for this pattern with tensor-level override support and dedicated MoE offload flags.

Step-by-step workflow

1

Find the model size

Check the total GGUF file size on disk. This is the minimum RAM + VRAM needed.
ls -lh /models/model.gguf
# or for split models:
ls -lh /models/model-*.gguf
Identify the number of layers and tensor names:
python3 gguf-py/scripts/gguf_dump.py /models/model.gguf | head -80
You can also browse the model on HuggingFace — click any .gguf file and scroll to the Tensors table.
2

Check available VRAM

nvidia-smi --query-gpu=name,memory.total,memory.free --format=csv
Subtract ~500–1000 MiB for OS/driver overhead. The remainder is usable VRAM for model weights and KV cache.
3

Decide your strategy

SituationRecommended approach
Model fits entirely in VRAM-ngl 999 (full GPU)
MoE model, experts don’t fit--cpu-moe or -ot "\.ffn_.*_exps\.=CPU"
Some MoE layers fit, others don’t--n-cpu-moe N or fine-grained -ot regex
Dense model, partial fit-ngl N for N layers
Very limited VRAM-ngl 0 --no-kv-offload (CPU only)
Also consider quantizing the KV cache to reclaim VRAM:
-ctk q8_0 -ctv q8_0
4

Run the server

Choose the configuration that matches your hardware and model. See the examples below.

MoE offload options

Simple: —cpu-moe

Keeps all MoE expert weights in RAM with a single flag. The easiest starting point:
llama-server \
  -m /models/DeepSeek-V3-IQ4_NL.gguf \
  -ngl 999 \
  --cpu-moe \
  -fa \
  --ctx-size 4096

Partial: —n-cpu-moe N

Keeps MoE weights for the first N layers in RAM, allowing the rest to live in VRAM. Use this when you have enough VRAM for some but not all expert layers:
llama-server \
  -m /models/model.gguf \
  -ngl 999 \
  --n-cpu-moe 60 \
  -fa

Fine-grained: -ot regex

-ot / --override-tensor matches tensor names by regex and assigns them to a device. This is the most powerful option, letting you target specific layers and tensor types. Pattern explanation:
-ngl 999 -ot "blk.(?:[0-9]|[1-7][0-9]|[8][0-7]).ffn._exps.=CPU"
PartMeaning
blk.Matches block (layer) prefix
(?:[0-9]|[1-7][0-9]|[8][0-7])Layers 0–87 (single digit, 10–79, 80–87)
.ffn._exps.Any tensor with ffn and _exps in the name (up/gate/down experts)
=CPUPlace these tensors on CPU
This leaves experts from layers 88 onward in VRAM (for models with ~94 layers, those later-layer experts can fit). Simpler pattern for any MoE model:
-ngl 999 -ot "\.ffn_.*_exps\.=CPU"
This moves all expert tensors to CPU regardless of layer number.
For models with shared experts (e.g., shexp tensors in GPT-OSS or GLM-5), put those in VRAM — they are always active and benefit from GPU speed. Check tensor names with gguf_dump.py to identify them, then exclude them from your CPU pattern.
For GLM-5 and similar models, the first few layers (blk.0, blk.1, blk.2) have dense FFN (no _exps), while layers from blk.3 onward have MoE experts. Dense layers should always stay in VRAM.

KV cache strategies for limited VRAM

The KV cache size scales with context length. At long contexts it can consume several GiB of VRAM.

Quantize the KV cache

Reduce KV cache VRAM usage by quantizing K and V from f16:
-ctk q8_0 -ctv q8_0
Example impact on a small model at 1024 context:
# Default f16:
llama_kv_cache_init: CPU KV buffer size = 3584.00 MiB

# With q8_0:
llama_kv_cache_init: CPU KV buffer size =   59.50 MiB
Available KV quantization types (build with -DGGML_IQK_FA_ALL_QUANTS=ON for more):
TypeNotes
f16Default. Full quality.
q8_0Half the size, minimal quality loss. Best starting point.
q8_KVFast 8-bit KV type specific to ik_llama.cpp.
q6_0Slightly smaller, still good quality.
bf16Available on CPUs with native BF16 support.
K-cache is more sensitive than V-cache. If you use different quantization levels, use a higher quality for K: -ctk q8_0 -ctv q6_0.
Use --k-cache-hadamard with heavily quantized KV caches (below Q6_0) to improve output quality:
-ctk q4_0 -ctv q4_0 --k-cache-hadamard

Keep KV cache in RAM

If VRAM is very tight, you can keep the entire KV cache in system RAM:
--no-kv-offload
This reduces prompt processing speed but frees VRAM for model weights.

Smart Expert Reduction (SER)

SER reduces the number of active experts below the model default, trading output quality for speed:
# Use 6 experts instead of model default (e.g. 8)
-ser 1,6

# Dynamic reduction: keep experts with confidence above threshold t
-ser Kmin,t
This is equivalent to REAP from the command line. Useful when you need more speed and can accept a slight quality tradeoff.

Quantization choices

Smaller quantizations reduce total model size, making more of the model fit in VRAM:
QuantNotes
BF16Too large for most setups. Use only for reference.
Q8_0Near-BF16 quality at half the size.
Q6_0 / IQ6_KMinimal quality loss vs Q8_0.
IQ5_KClose to Q8_0 quality, significantly smaller.
IQ4_XS, IQ4_NLMinimal loss. Good for large models.
IQ4_KS, IQ4_KSSik_llama.cpp exclusive, excellent quality/size ratio.
IQ3_KUsable quality; imatrix strongly recommended.
IQ2_K, IQ2_KSVery small; noticeable quality loss without imatrix.
For any quant below Q6_0, use an imatrix for best results. Check the model metadata for quantize.imatrix.* fields to see if the file was already quantized with one.

Practical example: Qwen3-30B-A3B on Zen4 CPU + single GPU

This is a real-world configuration for running Qwen3-30B-A3B (30B total, 3B active) with experts in RAM and attention layers in VRAM:
llama-server \
  -m /models/Qwen3-30B-A3B-IQ4_NL.gguf \
  -ngl 999 \
  -ot "\.ffn_.*_exps\.=CPU" \
  -fa \
  -ctk q8_0 \
  -ctv q8_0 \
  --ctx-size 8192 \
  -t 8 \
  -tb 2
What each flag does:
  • -ngl 999: load all non-expert layers to VRAM
  • -ot "\.ffn_.*_exps\.=CPU": keep all expert weight tensors in RAM
  • -fa: Flash Attention (reduces VRAM usage and speeds up PP)
  • -ctk q8_0 -ctv q8_0: quantize KV cache to save VRAM
  • -t 8: 8 CPU threads for generation (match physical core count)
  • -tb 2: fewer threads for batch processing when GPU handles most of it

Build docs developers (and LLMs) love