docker/ directory of this repository, which include llama-swap for model management and support both CPU and CUDA targets.
Community build: llama-swap + Podman/Docker
Thedocker/ directory contains ready-to-use Containerfiles and llama-swap config files for CPU and CUDA:
~/ik_llama/), then follow the steps below.
Building
- CPU
- CUDA
swap— includesllama-swapandllama-serveronly (recommended for serving)full— additionally includesllama-quantize,llama-sweep-bench,llama-perplexity, and other utilities
Running
Map your model directory to/models inside the container. The web UI is available at http://localhost:9292 and the OpenAI-compatible API at http://localhost:9292/v1.
- CPU
- CUDA
-it with -d. Stop the container with podman stop ik_llama or docker stop ik_llama.
Building your own image from .devops/
If you need more control, you can build directly from the upstream-inherited Dockerfiles in .devops/:
CUDA_VERSION=11.7.1 and CUDA_DOCKER_ARCH=all. Override them to match your environment:
Running with GPU passthrough
Requires nvidia-container-toolkit installed on the host.--n-gpu-layers 999 (or -ngl 999) to offload the entire model to VRAM. See the performance tips page for guidance on choosing the right value.
GPU selection
Use theCUDA_VISIBLE_DEVICES environment variable to restrict which GPUs the container uses:
Pinning to a specific commit
To build a community image from a specificik_llama.cpp commit, pass CUSTOM_COMMIT:
Troubleshooting
CUDA is not detected inside the container
CUDA is not detected inside the container
Make sure the NVIDIA Container Toolkit (Docker) or CDI (Podman) is installed and the host drivers match the CUDA version baked into the image. If CUDA is unavailable, fall back to the
ik_llama-cpu image.Models are not found
Models are not found
Verify the host path in your
-v flag points to the directory that actually contains your .gguf files. Inside the container the path must be /models.Build fails on a machine with a different CPU
Build fails on a machine with a different CPU
The Containerfiles default to
-DGGML_NATIVE=ON, which optimises for the build machine’s CPU. If you are building on a different machine from where you will run the container, change that flag to -DGGML_NATIVE=OFF in the Containerfile before building.Old CUDA images accumulate and use disk space
Old CUDA images accumulate and use disk space
List images with
docker image ls and remove unused base images such as docker.io/nvidia/cuda:12.4.0-devel-ubuntu22.04 with docker image rm.