llama-server is the primary inference server. It exposes an OpenAI-compatible HTTP API and an integrated web UI.
Example launch command
Basic
Basic
| Flag | Default | Description |
|---|---|---|
-m, --model FNAME | required | Path to the .gguf model file. |
--host HOST | 127.0.0.1 | IP address to listen on. Use 0.0.0.0 for LAN/external access. |
--port PORT | 8080 | TCP port to listen on. |
-c, --ctx-size N | from model | Context window size in tokens. Shared across parallel slots — increase with --parallel. |
Performance
Performance
| Flag | Default | Description |
|---|---|---|
-fa, --flash-attn | on | Enable Flash Attention. Improves throughput and reduces KV cache memory. |
-ngl, --gpu-layers N | 0 | Number of model layers to offload to VRAM. Use 999 to offload everything. |
-mla, --mla-use N | 3 | MLA mode for DeepSeek and other MLA-based models. 3 = FlashMLA (fastest). |
--fused-moe | enabled | Fuse ffn_up and ffn_gate ops for faster MoE inference. Disable with --no-fused-moe. |
Server
Server
| Flag | Default | Description |
|---|---|---|
--webui NAME | auto | Web UI to serve. Options: auto, llamacpp, none. |
--api-key KEY | none | Require this key in the Authorization header for all requests. |
-np, --parallel N | 1 | Number of parallel decode slots. --ctx-size is divided across all slots. |
Sampling
Sampling
| Flag | Default | Description |
|---|---|---|
--temp N | 0.8 | Sampling temperature. Lower = more deterministic. |
--top-k N | 40 | Keep only the top-K most likely tokens before sampling. |
--top-p N | 0.95 | Nucleus sampling threshold. |
--min-p N | 0.05 | Minimum probability relative to the top token. Useful alternative to top-p. |