MLX-VLM registers four CLI entry points when you install the package. Each one maps directly to a Python module’sDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/yocxy2/mlx-vlm/llms.txt
Use this file to discover all available pages before exploring further.
main() function.
| Command | Module | Purpose |
|---|---|---|
mlx_vlm.generate | mlx_vlm.generate | One-shot text generation from text, images, or audio |
mlx_vlm.chat_ui | mlx_vlm.chat_ui | Launch a Gradio chat interface |
mlx_vlm.convert | mlx_vlm.convert | Convert and quantize Hugging Face checkpoints |
mlx_vlm.server | mlx_vlm.server | Start an OpenAI-compatible HTTP server |
mlx_vlm.generate
The primary inference command. Supports text-only, image, audio, and multi-modal inputs.Examples
Thinking budget
For reasoning models such as Qwen3.5, you can cap the number of tokens spent inside the thinking block:\n</think> and transition to its answer. If --enable-thinking is set but the model’s chat template does not support it, the budget is applied only when the model generates the start token on its own.
Activation quantization (CUDA)
Models quantized withmxfp8 or nvfp4 require activation quantization on NVIDIA GPUs. Use the -qa shorthand or the full flag:
On Apple Silicon (Metal),
mxfp8 and nvfp4 models work without the -qa flag.Flag reference
| Flag | Type | Default | Description |
|---|---|---|---|
--model | string | mlx-community/nanoLLaVA-1.5-8bit | Hugging Face repo ID or local model path |
--adapter-path | string | None | Path to LoRA adapter weights |
--image | string (one or more) | None | URL(s) or local path(s) of images to process |
--audio | string (one or more) | None | URL(s) or local path(s) of audio files to process |
--resize-shape | int (one or two values) | None | Resize images to this shape before processing |
--prompt | string | "What are these?" | Text prompt sent to the model |
--system | string | None | System message prepended to the conversation |
--max-tokens | int | 256 | Maximum number of tokens to generate |
--temperature | float | 0.0 | Sampling temperature; 0 uses argmax (greedy) |
--max-kv-size | int | None | Maximum KV cache size for long-context prompts |
--kv-bits | int | None | Quantize the KV cache to this many bits |
--kv-group-size | int | 64 | Group size used when quantizing the KV cache |
--quantized-kv-start | int | 5000 | Token index at which KV cache quantization begins |
--prefill-step-size | int | 2048 | Tokens processed per prefill chunk; lower values reduce peak memory |
--enable-thinking | flag | False | Activate thinking mode in the chat template |
--thinking-budget | int | None | Maximum tokens allowed inside a thinking block |
--thinking-start-token | string | <think> | Token that opens a thinking block |
--thinking-end-token | string | </think> | Token that closes a thinking block |
--quantize-activations / -qa | flag | False | Enable activation quantization for mxfp8/nvfp4 models |
--processor-kwargs | JSON string | {} | Extra kwargs forwarded to the processor (e.g. '{"cropping": false}') |
--eos-tokens | string (one or more) | None | Additional end-of-sequence tokens |
--skip-special-tokens | flag | False | Omit special tokens from the decoded output |
--chat | flag | False | Enter multi-turn chat mode |
--verbose | flag | True | Print tokens and timing statistics as they are generated |
--trust-remote-code | flag | False | Allow execution of remote code when loading the model |
--revision | string | "main" | Model branch, tag, or commit to use |
--force-download | flag | False | Re-download the model even if it is already cached |
mlx_vlm.chat_ui
Launches an interactive Gradio chat interface in your browser.The
gradio package is an optional dependency. Install it with pip install 'mlx-vlm[ui]' before using this command.| Flag | Type | Default | Description |
|---|---|---|---|
--model | string | qnguyen3/nanoLLaVA | Hugging Face repo ID or local path of the model to load at startup |