oMLX runs vision-language models through the sameDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/jundot/omlx/llms.txt
Use this file to discover all available pages before exploring further.
VLMBatchedEngine that powers text LLMs, giving VLMs access to the full continuous batching stack and both tiers of the KV cache without any special-casing at the API layer. Images are processed by mlx-vlm’s vision encoder, merged with the text embedding sequence, and prefilled into the KV cache as a standard token sequence. After prefill, decoding uses regular token IDs—the visual context lives in cached KV blocks just like any other prompt prefix.
Supported Models
| Category | Models |
|---|---|
| General VLM | Qwen3.5 Series, GLM-4V, Pixtral, and other mlx-vlm compatible models |
| OCR | DeepSeek-OCR, DOTS-OCR, GLM-OCR |
config_model_type values like deepseekocr_2, dots_ocr, and glm_ocr.
OCR Auto-Detection
When an OCR model receives a chat completion request containing an image without an explicit OCR instruction, oMLX injects a model-specific default prompt automatically:| OCR Model | Default Injected Prompt |
|---|---|
| DeepSeek-OCR (v1 & v2) | Convert the document to markdown. |
| DOTS-OCR | Convert this page to clean Markdown while preserving reading order. |
| GLM-OCR | Text Recognition: |
<|user|>, <|im_start|>, etc.) to prevent degenerate output after the transcription ends—a common issue with models that lack proper EOS handling.
Image Input Formats
All three OpenAI-compatible image URL formats are supported:- Base64
- HTTP URL
- Local File
Multi-Image Chat
A single message can include multiple images. oMLX extracts all image references from the content array, processes them through the vision encoder in order, and merges the resulting embeddings with the surrounding text tokens before prefill.Full Example: Chat Completion with an Image
Tool Calling with Vision Context
VLMs support the same tool calling formats as text LLMs. The vision context (image embeddings) is present in the KV cache throughout the conversation, so tool calls can reference visual content in their arguments:KV Cache and Continuous Batching
VLM requests use the same tiered KV cache (hot RAM + cold SSD) and continuous batching scheduler as text LLMs. Vision features computed during prefill are stored in KV cache blocks under content-addressed hashes that incorporate the image hash as anextra_key. This means:
- The same image appearing in subsequent messages reuses cached vision features without rerunning the vision encoder.
- Multiple concurrent VLM requests are batched together at the token level by
BatchGenerator. - Long VLM conversations with large images can spill to the SSD cold tier and survive server restarts.
Model type is auto-detected. If a VLM fails to load via mlx-vlm (e.g., unsupported architecture), oMLX automatically falls back to loading it as a text LLM via mlx-lm and logs a warning.