Verify the GPU is being used
When you pass-ngl N (or --n-gpu-layers N), ik_llama.cpp prints diagnostic lines before inference starts. Look for the cublas offloading lines:
-DGGML_CUDA=ON and that the NVIDIA drivers and CUDA toolkit are installed correctly.
To offload as many layers as possible, pass a large number:
ik_llama.cpp will offload as many layers as fit in VRAM even if you request more than the model has.
Set the thread count correctly
The-t / --threads parameter controls how many CPU threads are used for token generation. Setting it too high is one of the most common causes of slow inference.
If you are unsure, start at 1 and double the value until you stop seeing speed improvements:
-t value | Guidance |
|---|---|
| Too high (above physical core count) | CPUs hyperthreads compete for shared execution units; speed falls |
| Physical core count | Usually optimal for token generation |
1 | A useful baseline — if this is dramatically faster than a higher value, your CPU is being oversaturated |
-tb / --threads-batch) can be set lower than generation threads. A value of 2 works well:
Avoid odd thread counts (1, 3, 5, …) — they can cause uneven work distribution on some CPU architectures.
Fit the model entirely in VRAM
Token generation speed scales with how much of the model is in VRAM. A model that fits entirely in VRAM (-ngl 999) is significantly faster than one partially kept in RAM:
- Using a lower quantisation (e.g.
IQ4_XSinstead ofQ8_0) - Quantising the KV cache (see below)
- Using tensor overrides (
-ot) to keep only the MoE expert tensors in RAM while everything else stays in VRAM
Flash Attention
Flash Attention (-fa) is enabled by default in ik_llama.cpp. It reduces KV cache memory usage and improves both prompt processing and token generation speed. Do not disable it unless you are debugging a specific problem.
Reduce context size
The KV cache grows with context size. Use the smallest context that meets your needs:Tune the batch sizes
-ub / --ubatch-size controls the physical batch size used during prompt processing (PP). The default is 512. Higher values improve GPU PP throughput at the cost of more VRAM:
-b / --batch-size (default 2048) is safe to leave at its default.
KV cache quantisation
Quantising the KV cache reduces VRAM consumption with minimal quality loss, freeing space to offload more model layers to the GPU:q8_0 is a conservative choice that has negligible quality impact on most models. The K-cache may benefit from a slightly higher quant than the V-cache if you notice quality degradation.
KV quantisation requires Flash Attention (
-fa), which is on by default. To access additional quantisation types beyond f16, q8_0, and q6_0, build with -DGGML_IQK_FA_ALL_QUANTS=ON.MoE models: fused MoE and MLA
For Mixture-of-Experts models, the--fused-moe (-fmoe) flag merges the ffn_up and ffn_gate projections into a single fused operation. It is enabled by default. If you disabled it, re-enable it:
-mla 3 with Flash Attention for best performance:
Benchmarking
Use the bundled benchmark tools to measure the effect of parameter changes before committing to a configuration.llama-sweep-bench — runs a series of PP batches followed by TG, reporting tokens/second at different KV cache fill levels. It accepts the same parameters as llama-server:
llama-bench — targeted microbenchmark for PP and TG:
Example: effect of flags on a 30B model
The following runs were measured on a machine with an A6000 GPU (48 GB VRAM), 7 physical CPU cores, and 32 GB RAM. The model wasWizard-Vicuna-30B-Uncensored.q4_0.gguf.
| Flags | Tokens/second |
|---|---|
-ngl 999 (no thread flag) | < 0.1 |
-t 7 (CPU only) | 1.7 |
-t 1 -ngl 999 | 5.5 |
-t 7 -ngl 999 | 8.7 |
-t 4 -ngl 999 | 9.1 |
- GPU offload (
-ngl 999) alone does almost nothing without setting-t— the saturated CPU becomes the bottleneck. - Matching
-tto physical core count (7) is roughly equivalent to using 4, but 4 slightly wins here — experiment for your hardware. - Combining GPU offload with the right thread count gives a 5× speed improvement over CPU-only inference.