ik_llama.cpp supports vision-language models through the libmtmd library. You can interact with multimodal models using either the llama-mtmd-cli command-line tool or the llama-server HTTP API.
Multimodal support is under active development and breaking changes are expected. The server integration is currently marked as a work in progress.
How it works
Multimodal support works by encoding images into embeddings using a separate model component, then feeding those embeddings into the language model alongside the text prompt. This requires two GGUF files:- The main language model (
.gguf) - A multimodal projector (
mmproj) file — handles image encoding and projection into the model’s embedding space
Supported models
ik_llama.cpp adds support for several vision models on top of the base llama.cpp multimodal stack. Notable additions include:
- Qwen3-VL — added in PR 883
- Qwen 2 VL / Qwen 2.5 VL
- Gemma 3 (vision variants)
- SmolVLM / SmolVLM2
- Pixtral 12B
- Mistral Small 3.1 24B
- InternVL 2.5 / InternVL 3
- LLaVA (legacy)
- MobileVLM (legacy)
- MiniCPM-V 2.5 / 2.6
Obtaining the mmproj file
For supported models, you can generate the projector file from the original HuggingFace checkpoint usingconvert_hf_to_gguf.py with the --mmproj flag:
tools/mtmd/legacy-models/.
Using llama-mtmd-cli
llama-mtmd-cli is the unified command-line interface for multimodal inference. It replaces the older model-specific binaries (qwen2vl-cli, gemma3-cli, etc.).
Basic image query
Interactive mode
img::
With GPU offload
Using llama-server
llama-server exposes multimodal capabilities through the standard OpenAI-compatible API. Pass both the model and projector files at startup:
image_url field of a chat message:
More examples
The examples/mtmd/ directory contains model-specific documentation, test images, and scripts. Seetests.sh for sample invocations covering a range of models and input types.