Skip to main content
The Docker images listed in this repository (ghcr.io/ggerganov/llama.cpp:*) are inherited from the upstream llama.cpp fork and have never been updated for ik_llama.cpp. They are outdated and will not include ik_llama.cpp-specific features, quantisation types, or performance improvements. Build your own image from the Dockerfiles in .devops/ to get ik_llama.cpp functionality.
The recommended path for containerised deployment is to use the community-maintained Containerfiles bundled in the docker/ directory of this repository, which include llama-swap for model management and support both CPU and CUDA targets.

Community build: llama-swap + Podman/Docker

The docker/ directory contains ready-to-use Containerfiles and llama-swap config files for CPU and CUDA:
docker/
├── ik_llama-cpu.Containerfile
├── ik_llama-cpu-swap.config.yaml
├── ik_llama-cuda.Containerfile
└── ik_llama-cuda-swap.config.yaml
Download those four files to a local directory (for example ~/ik_llama/), then follow the steps below.

Building

# Podman
podman image build --format Dockerfile \
  --file ik_llama-cpu.Containerfile --target full --tag ik_llama-cpu:full
podman image build --format Dockerfile \
  --file ik_llama-cpu.Containerfile --target swap --tag ik_llama-cpu:swap

# Docker
docker image build --file ik_llama-cpu.Containerfile --target full --tag ik_llama-cpu:full .
docker image build --file ik_llama-cpu.Containerfile --target swap --tag ik_llama-cpu:swap .
The build produces two image tags:
  • swap — includes llama-swap and llama-server only (recommended for serving)
  • full — additionally includes llama-quantize, llama-sweep-bench, llama-perplexity, and other utilities

Running

Map your model directory to /models inside the container. The web UI is available at http://localhost:9292 and the OpenAI-compatible API at http://localhost:9292/v1.
# Podman
podman run -it --name ik_llama --rm \
  -p 9292:8080 \
  -v /my_local_files/gguf:/models:ro \
  localhost/ik_llama-cpu:swap

# Docker
docker run -it --name ik_llama --rm \
  -p 9292:8080 \
  -v /my_local_files/gguf:/models:ro \
  ik_llama-cpu:swap
To run in the background, replace -it with -d. Stop the container with podman stop ik_llama or docker stop ik_llama.

Building your own image from .devops/

If you need more control, you can build directly from the upstream-inherited Dockerfiles in .devops/:
# Server with CUDA support
docker build -t local/llama.cpp:server-cuda -f .devops/llama-server-cuda.Dockerfile .

# Full image with utilities (CUDA)
docker build -t local/llama.cpp:full-cuda -f .devops/full-cuda.Dockerfile .

# CLI only (CUDA)
docker build -t local/llama.cpp:light-cuda -f .devops/llama-cli-cuda.Dockerfile .
The default build args are CUDA_VERSION=11.7.1 and CUDA_DOCKER_ARCH=all. Override them to match your environment:
docker build \
  --build-arg CUDA_VERSION=12.4.0 \
  --build-arg CUDA_DOCKER_ARCH=89 \
  -t local/llama.cpp:server-cuda \
  -f .devops/llama-server-cuda.Dockerfile .

Running with GPU passthrough

Requires nvidia-container-toolkit installed on the host.
docker run --gpus all \
  -v /path/to/models:/models \
  -p 8000:8000 \
  local/llama.cpp:server-cuda \
  -m /models/model.gguf \
  --port 8000 --host 0.0.0.0 \
  -n 512 \
  --n-gpu-layers 999
Pass --n-gpu-layers 999 (or -ngl 999) to offload the entire model to VRAM. See the performance tips page for guidance on choosing the right value.

GPU selection

Use the CUDA_VISIBLE_DEVICES environment variable to restrict which GPUs the container uses:
# Use only the first and third GPU
CUDA_VISIBLE_DEVICES=0,2 docker run --gpus all \
  -v /path/to/models:/models \
  local/llama.cpp:server-cuda \
  -m /models/model.gguf -ngl 999

Pinning to a specific commit

To build a community image from a specific ik_llama.cpp commit, pass CUSTOM_COMMIT:
docker image build \
  --file ik_llama-cuda.Containerfile \
  --target full \
  --build-arg CUSTOM_COMMIT="1ec12b8" \
  --tag ik_llama-cuda-1ec12b8:full .

Troubleshooting

Make sure the NVIDIA Container Toolkit (Docker) or CDI (Podman) is installed and the host drivers match the CUDA version baked into the image. If CUDA is unavailable, fall back to the ik_llama-cpu image.
Verify the host path in your -v flag points to the directory that actually contains your .gguf files. Inside the container the path must be /models.
The Containerfiles default to -DGGML_NATIVE=ON, which optimises for the build machine’s CPU. If you are building on a different machine from where you will run the container, change that flag to -DGGML_NATIVE=OFF in the Containerfile before building.
List images with docker image ls and remove unused base images such as docker.io/nvidia/cuda:12.4.0-devel-ubuntu22.04 with docker image rm.

Build docs developers (and LLMs) love