Skip to main content
llama-server is a fast, lightweight HTTP server that provides an OpenAI-compatible REST API for LLM inference. It is built on httplib and exposes a WebUI, parallel decoding, function calling, speculative decoding, and embeddings — all from a single binary.

What llama-server provides

OpenAI-compatible API

Drop-in replacement for the OpenAI REST API. Point any OpenAI client at your local server without code changes.

Built-in WebUI

Interact with the model directly in your browser at http://127.0.0.1:8080.

Parallel decoding

Serve multiple users simultaneously with continuous batching and configurable parallel slots.

Function calling

Tool use for virtually any model via Jinja template support. See the function calling docs.

Speculative decoding

Accelerate token generation using a draft model or ngram-based speculation.

Embeddings

Generate text embeddings via the /v1/embeddings endpoint for retrieval-augmented workflows.

Basic usage

CPU inference

./build/bin/llama-server \
  --model /path/to/model.gguf \
  --ctx-size 4096

GPU inference

Add -ngl 999 to offload all layers to VRAM:
./build/bin/llama-server \
  --model /path/to/model.gguf \
  --ctx-size 4096 \
  -ngl 999
Once the server starts, open http://127.0.0.1:8080 in your browser to access the WebUI.
Never expose the server directly to the internet without authentication. By default the server binds to 127.0.0.1 (localhost only). If you need network access, use --host 0.0.0.0 together with --api-key to require authentication.

Server options

Default: 127.0.0.1IP address the server listens on. Change to 0.0.0.0 to accept connections from other machines on your network.
--host 0.0.0.0
Default: 8080Port the server listens on.
--port 9000
Default: autoControls which WebUI to serve. Options:
ValueBehaviour
noneDisable the WebUI entirely
autoServe the default WebUI
llamacppServe the classic llama.cpp WebUI
--webui llamacpp
Default: none (no authentication)Require clients to supply an API key via the Authorization: Bearer <key> header.
--api-key mysecretkey
Default: noneSet the model name alias returned by the API. Useful when a client hard-codes a specific model name.
--alias my-model
Default: 1Number of parallel decode slots. Enables serving multiple users simultaneously. The total context (--ctx-size) is shared across all slots.
--parallel 4 --ctx-size 16384

API endpoints

Chat & completions

MethodEndpointDescription
POST/v1/chat/completionsOpenAI-compatible chat completions
POST/v1/completionsRaw text completions
POST/v1/embeddingsText embeddings
POST/v1/responsesOpenAI responses API

Monitoring

MethodEndpointDescription
GET/healthServer health check
GET/propsServer and model properties
GET/metricsPrometheus-compatible metrics

Example: chat completions request

curl http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-model",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "What is the capital of France?"
      }
    ],
    "temperature": 0.7,
    "max_tokens": 256
  }'
With an API key:
curl http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer mysecretkey" \
  -d '{
    "model": "my-model",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ]
  }'

Full startup example

./build/bin/llama-server \
  --model /models/Qwen_Qwen3-30B-A3B-IQ4_NL.gguf \
  --ctx-size 8192 \
  --host 0.0.0.0 \
  --port 8080 \
  --api-key mysecretkey \
  --alias qwen3-30b \
  --parallel 2 \
  -ngl 999 \
  -fa \
  --jinja

Build docs developers (and LLMs) love