The core workflow is three steps: load the model, format the prompt, then generate output.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/yocxy2/mlx-vlm/llms.txt
Use this file to discover all available pages before exploring further.
Step-by-step walkthrough
Load the model
load() downloads the model from Hugging Face (or reads from the local cache) and returns the model and processor objects.quantize_activations=True when working with mxfp8 or nvfp4 quantized models on NVIDIA GPUs:Format the prompt
apply_chat_template() wraps your prompt in the correct chat template for the loaded model. Pass num_images and/or num_audios to insert the right number of media tokens.Usage examples
- Image inference
- Audio inference
- Multi-modal (image + audio)
You can also pass
PIL.Image.Image objects directly instead of file paths or URLs:
image = [Image.open("photo.jpg")]Streaming with stream_generate
Usestream_generate() when you want to process tokens as they are produced rather than waiting for the full response. It yields GenerationResult objects, each containing the latest text segment and running statistics.
GenerationResult fields available on each yielded object:
| Field | Type | Description |
|---|---|---|
text | str | Text segment generated since the last yield |
token | int | Most recently generated token ID |
prompt_tokens | int | Number of tokens in the prompt |
generation_tokens | int | Number of tokens generated so far |
prompt_tps | float | Prompt processing speed (tokens/sec) |
generation_tps | float | Generation speed (tokens/sec) |
peak_memory | float | Peak memory usage in GB |
generate() parameters
The loaded model returned by
load().The processor returned by
load().The formatted prompt string. Use
apply_chat_template() to produce this value.One or more images passed as file paths, URLs, or
PIL.Image.Image objects.One or more audio files passed as local paths or URLs.
When
True, prints each token and a timing summary to stdout as generation proceeds.Maximum number of tokens to generate.
Sampling temperature. Set to
0 for deterministic greedy decoding.Nucleus sampling threshold. The model samples from the smallest set of tokens whose cumulative probability exceeds this value.
Restrict sampling to the top-k most probable tokens.
0 disables top-k filtering.Minimum probability threshold relative to the top token. Tokens below this threshold are discarded.
Penalty applied to tokens that have already appeared in the context. Values above
1.0 discourage repetition.Number of preceding tokens to consider when applying the repetition penalty.
Cap the KV cache to this number of tokens. Useful for very long prompts.
Quantize the KV cache to this many bits to reduce memory usage.
Number of tokens processed per prefill chunk. Lower values reduce peak memory at the cost of slower prefill.
Resize images to this shape before processing. Pass a single integer for a square crop or a
(height, width) tuple.