Running the server

llama-server is a fast, lightweight HTTP server that provides an OpenAI-compatible REST API for LLM inference. It is built on httplib and exposes a WebUI, parallel decoding, function calling, speculative decoding, and embeddings — all from a single binary.

What llama-server provides

OpenAI-compatible API

Drop-in replacement for the OpenAI REST API. Point any OpenAI client at your local server without code changes.

Built-in WebUI

Interact with the model directly in your browser at http://127.0.0.1:8080.

Parallel decoding

Serve multiple users simultaneously with continuous batching and configurable parallel slots.

Function calling

Tool use for virtually any model via Jinja template support. See the function calling docs.

Speculative decoding

Accelerate token generation using a draft model or ngram-based speculation.

Embeddings

Generate text embeddings via the /v1/embeddings endpoint for retrieval-augmented workflows.

Basic usage

CPU inference

./build/bin/llama-server \
  --model /path/to/model.gguf \
  --ctx-size 4096

GPU inference

Add -ngl 999 to offload all layers to VRAM:

./build/bin/llama-server \
  --model /path/to/model.gguf \
  --ctx-size 4096 \
  -ngl 999

Once the server starts, open http://127.0.0.1:8080 in your browser to access the WebUI.

Never expose the server directly to the internet without authentication. By default the server binds to 127.0.0.1 (localhost only). If you need network access, use --host 0.0.0.0 together with --api-key to require authentication.

Server options

--host

Default: 127.0.0.1IP address the server listens on. Change to 0.0.0.0 to accept connections from other machines on your network.

--host 0.0.0.0

--port

Default: 8080Port the server listens on.

--port 9000

--webui

Default: autoControls which WebUI to serve. Options:

Value	Behaviour
`none`	Disable the WebUI entirely
`auto`	Serve the default WebUI
`llamacpp`	Serve the classic llama.cpp WebUI

--webui llamacpp

--api-key

Default: none (no authentication)Require clients to supply an API key via the Authorization: Bearer <key> header.

--api-key mysecretkey

--alias / -a

Default: noneSet the model name alias returned by the API. Useful when a client hard-codes a specific model name.

--alias my-model

--parallel / -np

Default: 1Number of parallel decode slots. Enables serving multiple users simultaneously. The total context (--ctx-size) is shared across all slots.

--parallel 4 --ctx-size 16384

API endpoints

Chat & completions

Method	Endpoint	Description
`POST`	`/v1/chat/completions`	OpenAI-compatible chat completions
`POST`	`/v1/completions`	Raw text completions
`POST`	`/v1/embeddings`	Text embeddings
`POST`	`/v1/responses`	OpenAI responses API

Monitoring

Method	Endpoint	Description
`GET`	`/health`	Server health check
`GET`	`/props`	Server and model properties
`GET`	`/metrics`	Prometheus-compatible metrics

Example: chat completions request

curl http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-model",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "What is the capital of France?"
      }
    ],
    "temperature": 0.7,
    "max_tokens": 256
  }'

With an API key:

curl http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer mysecretkey" \
  -d '{
    "model": "my-model",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ]
  }'

Full startup example

./build/bin/llama-server \
  --model /models/Qwen_Qwen3-30B-A3B-IQ4_NL.gguf \
  --ctx-size 8192 \
  --host 0.0.0.0 \
  --port 8080 \
  --api-key mysecretkey \
  --alias qwen3-30b \
  --parallel 2 \
  -ngl 999 \
  -fa \
  --jinja

GPU offloading — Maximize performance with CUDA
Hybrid CPU/GPU inference — Run models larger than VRAM
Parameters reference — Full CLI parameter reference

Get Started

Inference

Quantization

Advanced Features

Deployment

What llama-server provides

OpenAI-compatible API

Built-in WebUI

Parallel decoding

Function calling

Speculative decoding

Embeddings

Basic usage

CPU inference

GPU inference

Server options

API endpoints

Chat & completions

Monitoring

Example: chat completions request

Full startup example

Build docs developers (and LLMs) love

Get Started

Inference

Quantization

Advanced Features

Deployment

Documentation Index

​What llama-server provides

OpenAI-compatible API

Built-in WebUI

Parallel decoding

Function calling

Speculative decoding

Embeddings

​Basic usage

​CPU inference

​GPU inference

​Server options

​API endpoints

​Chat & completions

​Monitoring

​Example: chat completions request

​Full startup example

​Related pages

Build docs developers (and LLMs) love

What llama-server provides

Basic usage

CPU inference

GPU inference

Server options

API endpoints

Chat & completions

Monitoring

Example: chat completions request

Full startup example

Related pages