llama-server is a fast, lightweight HTTP server that provides an OpenAI-compatible REST API for LLM inference. It is built on httplib and exposes a WebUI, parallel decoding, function calling, speculative decoding, and embeddings — all from a single binary.
What llama-server provides
OpenAI-compatible API
Drop-in replacement for the OpenAI REST API. Point any OpenAI client at your local server without code changes.
Built-in WebUI
Interact with the model directly in your browser at
http://127.0.0.1:8080.Parallel decoding
Serve multiple users simultaneously with continuous batching and configurable parallel slots.
Function calling
Tool use for virtually any model via Jinja template support. See the function calling docs.
Speculative decoding
Accelerate token generation using a draft model or ngram-based speculation.
Embeddings
Generate text embeddings via the
/v1/embeddings endpoint for retrieval-augmented workflows.Basic usage
CPU inference
GPU inference
Add-ngl 999 to offload all layers to VRAM:
Never expose the server directly to the internet without authentication. By default the server binds to
127.0.0.1 (localhost only). If you need network access, use --host 0.0.0.0 together with --api-key to require authentication.Server options
--host
--host
Default:
127.0.0.1IP address the server listens on. Change to 0.0.0.0 to accept connections from other machines on your network.--port
--port
Default:
8080Port the server listens on.--webui
--webui
Default:
autoControls which WebUI to serve. Options:| Value | Behaviour |
|---|---|
none | Disable the WebUI entirely |
auto | Serve the default WebUI |
llamacpp | Serve the classic llama.cpp WebUI |
--api-key
--api-key
Default: none (no authentication)Require clients to supply an API key via the
Authorization: Bearer <key> header.--alias / -a
--alias / -a
Default: noneSet the model name alias returned by the API. Useful when a client hard-codes a specific model name.
--parallel / -np
--parallel / -np
Default:
1Number of parallel decode slots. Enables serving multiple users simultaneously. The total context (--ctx-size) is shared across all slots.API endpoints
Chat & completions
| Method | Endpoint | Description |
|---|---|---|
POST | /v1/chat/completions | OpenAI-compatible chat completions |
POST | /v1/completions | Raw text completions |
POST | /v1/embeddings | Text embeddings |
POST | /v1/responses | OpenAI responses API |
Monitoring
| Method | Endpoint | Description |
|---|---|---|
GET | /health | Server health check |
GET | /props | Server and model properties |
GET | /metrics | Prometheus-compatible metrics |
Example: chat completions request
Full startup example
Related pages
- GPU offloading — Maximize performance with CUDA
- Hybrid CPU/GPU inference — Run models larger than VRAM
- Parameters reference — Full CLI parameter reference