Skip to main content
Complete reference for llama-server, llama-cli, llama-sweep-bench, llama-bench, llama-imatrix, and llama-quantize parameters. All parameters supported by llama-server can also be used with the other tools where applicable.
Common terms used throughout this documentation and in model descriptions.
TermMeaning
LLM / modelLarge Language Model trained on vast amounts of text using machine learning.
TensorsThe foundational building block of a model — a multi-dimensional array of numbers (scalar, vector, matrix, or higher-dimensional).
LayersModular units stacked to form the network, each transforming the input tensors in some way.
WeightsNumerical values associated with connections between tensors in each layer.
ActivationsOutput of a layer after it has performed its computations.
FAFlash Attention — an efficient transformer attention algorithm.
VRAMDedicated memory on the GPU.
InferenceRunning a model to generate responses.
GGUFThe file format used by ik_llama.cpp and llama.cpp.
QuantsCompressed model formats that reduce precision to save space and improve speed.
BPWBits per weight — measures the compression ratio of a quant.
imatrixImportance matrix generated from calibration text; improves quantization quality.
Model splitsA GGUF file split into multiple parts for easier upload/download. Specify only the first part when loading.
PPPrompt processing — encoding the input tokens.
TGToken generation — producing the output tokens one by one.
t/sTokens per second — measures PP and TG speed.
Full GPUAll tensors and computation offloaded to the GPU.
Hybrid CPU/GPUPartial offload — some tensors in VRAM, others in RAM.
Core parameters for loading and running any model.
ParameterDescriptionDefaultNotes
-h, --help, --usagePrint usage and exit
--fitAutomatically fit to available VRAMoffLoads as many tensors to GPU as VRAM permits. Cannot be used with --cpu-moe, --n-cpu-moe, or tensor overrides.
--fit-margin NSafety VRAM margin in MiB when using --fit1024Increase if you get CUDA OOM during model load.
-t, --threads NThreads for token generation4Match the number of physical CPU cores. Avoid odd numbers.
-tb, --threads-batch NThreads for batch/prompt processingSame as --threadsFor full GPU offload, use a lower number (e.g. 2).
-c, --ctx-size NContext size (prompt + generation)0 (from model)Determines KV cache size. With parallel slots, this is split across all slots.
-n, --predict NMax tokens to generate-1 (infinity)-2 = fill context. Safe to leave at default.
-b, --batch-size NLogical maximum batch size2048Higher values may improve t/s on GPU at the cost of memory.
-ub, --ubatch-size NPhysical maximum batch size512Similar effect to --batch-size.
--keep NTokens to keep from initial prompt0-1 = keep all.
--chunks NMax chunks to process-1 (all)
-fa, --flash-attnEnable Flash AttentiononImproves t/s and reduces memory usage. Use auto/on/off.
--no-fa, --no-flash-attnDisable Flash AttentionAlternative to -fa off.
-mla, --mla-useEnable MLA30/1/2/3. For DeepSeek and other MLA models.
-amb, --attention-max-batchMax batch size for attention0Specifies maximum K*Q size in MB to tolerate.
-fmoe, --fused-moeFuse ffn_up and ffn_gate in MoESpeedup for MoE models.
--no-fmoe, --no-fused-moeDisable fused MoEEnabledSee --fused-moe.
-ger, --grouped-expert-routingEnable grouped expert routingDisabledFor BailingMoeV2 architecture (Ling/Ring models).
--no-fug, --no-fused-up-gateDisable fused up-gateEnabledTurns off the up-gate speedup for dense models.
--no-mmad, --no-fused-mul-multiaddDisable fused mul-multi_addEnabled
-gr, --graph-reuseEnable graph reuseEnabledFor models with fast TG (100+ t/s).
--no-gr, --no-graph-reuseDisable graph reuseDisabled
-ser, --smart-expert-reductionExpert reduction Kmin,t-1, 0Use fewer active experts. -ser 1,6 uses exactly 6 experts.
-mqkv, --merge-qkvMerge Q, K, V projections0Downside: mmap cannot be used.
-muge, --merge-up-gate-expertsMerge ffn_up/gate_exps0Speedup on some models.
-khad, --k-cache-hadamardHadamard transform for K-cache0May improve quality at low KV quantization levels.
-sas, --scheduler_asyncAsync evaluation of compute graphs0
-vq, --validate-quantsValidate quantized data on load0Reports NaN tensors in the loaded model.
-sp, --specialEnable special token outputfalse
--no-warmupSkip empty warmup run
--mlockKeep model in RAM (no swap)
--no-mmapDisable memory-mapped model loadingSlower load but may reduce pageouts.
-rtr, --run-time-repackRepack tensors to interleaved formatik_llama.cpp exclusive. May improve performance.
--ctx-checkpoints NCheckpoints per slotFor recurrent models (Qwen3-Next, Qwen3.5-MoE).
--ctx-checkpoints-interval NMin tokens between checkpointsSmaller values = more frequent checkpoints during PP.
Speculative decoding accelerates generation by using a fast draft model to predict multiple tokens ahead, which the main model then verifies in a single forward pass.
ParameterDescriptionDefaultNotes
-td, --threads-draft NThreads for draft model generationSame as --threads
-tbd, --threads-batch-draft NThreads for draft model batch processingSame as --threads-draft
-ps, --p-split NSpeculative decoding split probability0.1
-cd, --ctx-size-draft NContext size for draft model0 (from model)Similar to --ctx-size but for the draft model.
-ctkd, --cache-type-k-draft TYPEKV cache K type for draft modelSee -ctk.
-ctvd, --cache-type-v-draft TYPEKV cache V type for draft modelSee -ctv.
-draft, --draft-paramsComma-separated draft model parameters
--spec-ngram-size-n Nngram lookup size N12For ngram-simple/ngram-map speculative decoding.
--spec-ngram-size-m Nngram draft size M48For ngram-simple/ngram-map speculative decoding.
--spec-ngram-min-hits NMin hits for ngram-map1
--spec-type NameSpeculative decoding typenone, ngram-cache, ngram-simple, ngram-map-k, ngram-map-k4v, ngram-mod.
-mtp, --multi-token-predictionEnable MTP decodingFor GLM-4.x MoE models.
-no-mtp, --no-multi-token-predictionDisable MTP decoding
--draft-maxMax draft tokensFor MTP decoding.
--draft-p-minMin draft probabilityFor MTP decoding.
When a conversation ends, its KV cache is saved to RAM and can be restored when the same or similar prompt is seen again. This greatly reduces prompt processing time when switching between conversations.
If available RAM is very limited, disable this with -cram 0 to avoid memory swapping.
ParameterDescriptionDefaultNotes
-cram, --cache-ram NMaximum cache size in MiB8192-1 = no limit. 0 = disable. Especially useful for coding agents that re-send similar prompts.
-crs, --cache-ram-similarity NSimilarity threshold to trigger cache reuse0.50
-cram-n-min, --cache-ram-n-min NMin cached tokens to trigger cache reuse0
Sampling controls how tokens are selected during generation. The default sampler pipeline provides a good balance for most use cases.For a detailed overview of sampling techniques, see the llm_samplers_explained guide.
ParameterDescriptionDefaultNotes
--samplers SAMPLERSOrdered sampler pipeline (semicolon-separated)dry;top_k;tfs_z;typical_p;top_p;min_p;xtc;top_n_sigma;temperature;adaptive_pExample: --samplers min_p;temperature
--sampling-seq SEQUENCEShorthand sampler sequencedkfypmxntwSame as --samplers in abbreviated form.
--banned-string-fileFile containing banned output strings (one per line)
--banned-nNumber of tokens banned during rewind-1-1 = all tokens.
The prompt template controls how chat messages are formatted before being sent to the model. An incorrect template can significantly degrade output quality.
ParameterDescriptionDefaultNotes
--jinjaUse Jinja template from model metadataTemplate from modelRequired for function/tool calling.
--chat-template JINJA_TEMPLATEOverride chat template inlineDisabledUse --chat-template chatml as a fallback when no official tool_use template exists.
--chat-template-file FILELoad chat template from fileUseful when the GGUF metadata contains a buggy template — download only the fixed .jinja file instead of re-downloading the full model.
--reasoning-format FORMATControl reasoning/think tag handlingnonenone: leave thoughts in message.content. deepseek: move thoughts to message.reasoning_content. deepseek-legacy: keep tags in content AND populate reasoning_content.
--chat-template-kwargs JSONAdditional params for the Jinja template parserExample: --chat-template-kwargs '{"reasoning_effort": "medium"}'
--reasoning-budget NMax thinking tokens allowed-1 (unrestricted)0 = disable thinking.
--reasoning-tokens FORMATExclude reasoning tokens for slot selectionauto
The KV cache stores past attention computations to avoid reprocessing tokens. These parameters control where the cache lives and how it is quantized.The KV cache is stored on the same device as the associated attention tensors. Quantizing the KV cache can significantly reduce VRAM usage.
ParameterDescriptionDefaultNotes
-dkvc, --dump-kv-cacheVerbose KV cache debug output
-nkvo, --no-kv-offloadKeep KV cache on CPUFrees VRAM but reduces prompt processing speed.
-ctk, --cache-type-k TYPEKV cache data type for Kf16Reduces K size; may slightly affect quality. Requires Flash Attention.
-ctv, --cache-type-v TYPEKV cache data type for Vf16See -ctk. K-cache usually needs higher quality than V-cache.
--no-context-shiftDisable context shift
--context-shiftConfigure context shiftonauto/on/off/0/1. Slides the KV window when context is full.
KV cache types (build with -DGGML_IQK_FA_ALL_QUANTS=ON for the full list):
TypeNotes
f16Default. Full precision.
q8_0Half the size, minimal quality loss.
q8_KVFast ik_llama.cpp-specific 8-bit KV type.
q6_0Good quality/size balance.
bf16Available on CPUs with native BF16 support.
Serve multiple users or frontends simultaneously. The WebUI uses parallel slots to allow starting a new chat while another is still generating.
ParameterDescriptionDefaultNotes
-np, --parallel NNumber of parallel decode slots1The total --ctx-size is divided across all slots.
ik_llama.cpp provides extensive control over what runs on the GPU. For a full guide, see GPU offloading and Hybrid CPU/GPU inference.
ParameterDescriptionDefaultNotes
-ngl, --gpu-layers NLayers to store in VRAMUse 999 to offload everything. For MoE, use more than the model layer count.
-ngld, --gpu-layers-draft NLayers for draft model in VRAMSee -ngl.
--cpu-moeKeep all MoE expert weights in RAMSimple one-flag hybrid mode for MoE.
--n-cpu-moe NKeep first N layers’ MoE weights in RAMUseful when some VRAM is available for experts.
-sm, --split-mode MODEMulti-GPU split strategynonenone: single GPU. layer: split by layer. graph: split computation graph (best for mixed GPU setups).
-ts, --tensor-split SPLITVRAM fraction per GPU (comma-separated)Example: -ts 3,1 gives 75% to GPU 0, 25% to GPU 1.
-dev, --device LISTSpecific GPU devices to useExample: -dev CUDA0,CUDA1.
-devd, --device-draft LISTGPU devices for draft model
-mg, --main-gpu iGPU index for single-GPU modeUsed with -sm none.
-ot, --override-tensor REGEX=DEVICEPlace tensors by regexExample: \.ffn_.*_exps\.=CPU. Can be specified multiple times.
-op, --offload-policy a,bPer-operation offload controla = GGML op enum value, b = 0 (CPU) or 1 (GPU). -op -1,0 disables all GPU offload.
-ooae, --offload-only-active-expertsOffload only activated MoE expertsONReduces RAM→VRAM transfer for sparse models.
-no-ooaeDisable active-expert-only offloadMay help when large batches activate most experts.
--fitAuto-fit tensors to available VRAMoffCannot be combined with --cpu-moe, --n-cpu-moe, or -ot.
--fit-margin NVRAM safety margin for --fit (MiB)1024Increase if CUDA OOM occurs during load.
-grt, --graph-reduce-type TYPEData type for inter-GPU transfersf32q8_0/bf16/f16/f32. Lower precision = less bandwidth used.
--max-gpu NMax GPUs per layer with graph splitUseful when using all GPUs hurts performance.
-cuda, --cuda-params LISTCUDA-specific tuning parametersControls fusion, offload threshold, MMQ-ID threshold. Example: -cuda graphs=0.
-cuda fa-offset=VALUEFP16 precision offset for FA0Fix FP16 overflow in FA at very long contexts. Value in [0..3].
-smgs, --split-mode-graph-schedulingForce graph scheduling in split mode0
Parameters for configuring how the model is loaded and how draft models work.
ParameterDescriptionDefaultNotes
-m, --model FNAMEPath to model GGUF filemodels/$filenameRequired. For split models, specify only the first part.
-md, --model-draft FNAMEDraft model for speculative decodingunused
--draft-max, --draft, --draft-n NMax draft tokens for speculative decoding16
--draft-min, --draft-n-min NMin draft tokens
--draft-p-min PMin speculative decoding probability0.8
--check-tensorsValidate tensor data on loadfalse
--override-kv KEY=TYPE:VALUEOverride model metadataTypes: int, float, bool, str. Example: --override-kv tokenizer.ggml.add_bos_token=bool:false.
Parameters specific to llama-server.
ParameterDescriptionDefaultNotes
--host HOSTIP address to listen on127.0.0.1Use 0.0.0.0 for network access. Never expose to the internet without authentication.
--port PORTPort to listen on8080
--webui NAMEWhich WebUI to serveautonone: disabled. auto: default. llamacpp: classic llama.cpp UI.
--api-key KEYAPI authentication keynoneClients must supply this via Authorization: Bearer.
-a, --alias NAMEModel name alias for the APInoneUseful when clients expect a specific model name.

sweep-bench

Benchmarks prompt processing and token generation across a sweep of batch sizes. The KV cache is not cleared between runs, so the N_KV column shows how many tokens were in cache.
llama-sweep-bench \
  -m /models/model.gguf \
  -c 12288 -ub 512 \
  -rtr -fa \
  -ctk q8_0 -ctv q8_0
ParameterDescriptionDefault
-nrep N, --n-repetitions NNumber of repetitions at zero context
-n NNumber of TG tokensubatch/4

llama-bench

Standard benchmark utility.
llama-bench -tgb 4,16 -p 512 -n 128 [other_args]
ParameterDescriptionDefault
-tgb, --threads-gen-batchDifferent thread count for generation vs batch processing

llama-imatrix

Generate an importance matrix from calibration text. The imatrix improves quantization quality across all quant types.
llama-imatrix \
  -m /models/model-bf16.gguf \
  -f /data/calibration_data_v5_rc.txt \
  -o /models/model.imatrix
ParameterDescriptionDefault
--layer-similarity, -lsimCollect activation change statistics using cosine similarity
--hide-imatrixAnonymize the imatrix data file
Notes:
  • Use convert_imatrix_gguf_to_dat.py to convert GGUF imatrix files to the format used internally.
  • imatrix calculation supports models with merged ffn_up/gate_exps tensors.

llama-quantize

Quantize a BF16 or F16 model to a compressed format.
llama-quantize \
  --imatrix /models/model.imatrix \
  /models/model-bf16.gguf \
  /models/model-IQ4_NL.gguf \
  IQ4_NL
To split the output for easier distribution:
llama-gguf-split \
  --split --split-max-size 1G \
  --no-tensor-first-split \
  /models/model-IQ4_NL.gguf \
  /models/parts/model-IQ4_NL.gguf
ParameterDescriptionDefault
--custom-q "regex1=type1,regex2=type2..."Custom per-tensor quantization rules using regex
--dry-runPrint tensor types and sizes without running quantization
--partial-requantOnly quantize missing split files in the destination directory
CMake build configuration flags.
cmake -B build -DGGML_NATIVE=ON
cmake --build build --config Release -j$(nproc)
ArgumentNotes
-DGGML_NATIVE=ONOptimize for the host CPU. Turn off when cross-compiling.
-DGGML_CUDA=ONBuild with CUDA support.
-DCMAKE_CUDA_ARCHITECTURES=86Target a specific CUDA compute capability (e.g. 86 for RTX 3x00).
-DGGML_ARCH_FLAGS="-march=armv8.2-a+dotprod+fp16"Pass architecture flags directly.
-DGGML_RPC=ONBuild the RPC backend.
-DGGML_IQK_FA_ALL_QUANTS=ONEnable all KV cache quantization types.
-DLLAMA_SERVER_SQLITE3=ONEnable SQLite3 support (for mikupad).
-DCMAKE_TOOLCHAIN_FILE=[...]Specify a CMake toolchain file (e.g. for Windows + SQLite3).
-DGGML_NCCL=OFFDisable NCCL.
Environment variables that influence runtime behavior.
CUDA_VISIBLE_DEVICES=0,2 llama-server -m /models/model-bf16.gguf
VariableNotes
CUDA_VISIBLE_DEVICESRestrict which GPUs are visible. Example: 0,2 uses the first and third GPU.

Build docs developers (and LLMs) love