Skip to main content
llama-server is the primary inference server. It exposes an OpenAI-compatible HTTP API and an integrated web UI. Example launch command
./build/bin/llama-server \
  --model /models/Qwen_Qwen3-30B-A3B-IQ4_NL.gguf \
  --host 0.0.0.0 --port 8080 \
  --ctx-size 8192 \
  -ngl 999 --flash-attn \
  --parallel 2 --api-key mysecret
FlagDefaultDescription
-m, --model FNAMErequiredPath to the .gguf model file.
--host HOST127.0.0.1IP address to listen on. Use 0.0.0.0 for LAN/external access.
--port PORT8080TCP port to listen on.
-c, --ctx-size Nfrom modelContext window size in tokens. Shared across parallel slots — increase with --parallel.
FlagDefaultDescription
-fa, --flash-attnonEnable Flash Attention. Improves throughput and reduces KV cache memory.
-ngl, --gpu-layers N0Number of model layers to offload to VRAM. Use 999 to offload everything.
-mla, --mla-use N3MLA mode for DeepSeek and other MLA-based models. 3 = FlashMLA (fastest).
--fused-moeenabledFuse ffn_up and ffn_gate ops for faster MoE inference. Disable with --no-fused-moe.
FlagDefaultDescription
--webui NAMEautoWeb UI to serve. Options: auto, llamacpp, none.
--api-key KEYnoneRequire this key in the Authorization header for all requests.
-np, --parallel N1Number of parallel decode slots. --ctx-size is divided across all slots.
FlagDefaultDescription
--temp N0.8Sampling temperature. Lower = more deterministic.
--top-k N40Keep only the top-K most likely tokens before sampling.
--top-p N0.95Nucleus sampling threshold.
--min-p N0.05Minimum probability relative to the top token. Useful alternative to top-p.

Build docs developers (and LLMs) love