LoRA (Low-Rank Adaptation) injects small trainable matrices into the model’s attention layers while keeping the original weights frozen. Because only the adapter parameters are updated, training requires far less memory and time than full fine-tuning, and the resulting adapter file is a fraction of the full model size. QLoRA extends this by loading the base model in a quantized format (e.g., 4-bit), keeping it frozen in low precision while training the LoRA adapters in full precision — trading a small amount of accuracy for a large reduction in memory usage.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/yocxy2/mlx-vlm/llms.txt
Use this file to discover all available pages before exploring further.
Basic usage
Prepare your dataset
Create or identify a Hugging Face dataset with
images and messages columns in the format your target model expects. See Dataset preparation for details.CLI reference
Model arguments
| Argument | Default | Description |
|---|---|---|
--model-path | mlx-community/Qwen2-VL-2B-Instruct-bf16 | Path or Hub ID of the base model to fine-tune. |
--full-finetune | — | Update all model weights instead of using LoRA adapters. |
--train-vision | — | Unfreeze and train the vision encoder alongside the language model. |
Dataset arguments
| Argument | Default | Description |
|---|---|---|
--dataset | (required) | Local path or Hugging Face dataset identifier. |
--split | train | Dataset split to use. |
--dataset-config | — | Dataset configuration name (for datasets with multiple configs). |
--image-resize-shape | — | Resize all images to a fixed shape, e.g. 768 768. |
--custom-prompt-format | — | JSON template for datasets with question/answer columns instead of messages. |
Training arguments
| Argument | Default | Description |
|---|---|---|
--learning-rate | 2e-5 | Optimizer learning rate. |
--batch-size | 4 | Number of samples per training step. |
--iters | 1000 | Total training iterations. Ignored if --epochs is set. |
--epochs | — | Number of full passes over the dataset. Overrides --iters. |
--steps-per-report | 10 | Log loss and throughput every N steps. |
--steps-per-eval | 200 | Run validation every N steps. |
--steps-per-save | 100 | Save a checkpoint every N steps. |
--val-batches | 25 | Number of batches used for each validation run. |
--max-seq-length | 2048 | Maximum token sequence length; longer sequences are truncated. |
--grad-checkpoint | — | Enable gradient checkpointing to reduce peak memory (slightly slower). |
--grad-clip | — | Clip gradients to this maximum norm. |
--train-on-completions | — | Compute loss only on assistant responses, not on the prompt. |
--gradient-accumulation-steps | 1 | Accumulate gradients over N batches before updating weights. |
--assistant-id | 77091 | Token ID used to identify the start of assistant turns (for completion masking). |
LoRA arguments
| Argument | Default | Description |
|---|---|---|
--lora-rank | 8 | Rank of the LoRA decomposition matrices. Higher values increase adapter expressiveness. |
--lora-alpha | 16 | Scaling factor applied to the LoRA updates. Effective learning rate scales with lora-alpha / lora-rank. |
--lora-dropout | 0.0 | Dropout probability applied to LoRA layers during training. |
Output arguments
| Argument | Default | Description |
|---|---|---|
--output-path | adapters.safetensors | File path where the trained adapter is saved. |
--adapter-path | — | Path to an existing adapter to resume training from. |
Training examples
Python API
You can drive training programmatically by constructing anargparse.Namespace and calling main:
- Basic LoRA
- QLoRA
- Full fine-tuning
- Resume training
Training output
The script logs progress at the interval set by--steps-per-report. Each report includes:
- Current step and total steps
- Loss at the current step
- Running average loss
- Throughput in tokens/sec
- Estimated time remaining
--output-path.
Training tips
Memory optimization
Memory optimization
- Enable
--grad-checkpointto reduce peak memory at the cost of slightly longer training time. - Reduce
--batch-sizeto1or2if you run out of memory. - Use
--gradient-accumulation-stepsto maintain an equivalent effective batch size without holding more activations in memory (e.g.,--batch-size 1 --gradient-accumulation-steps 8approximates a batch size of 8). - Use QLoRA with a 4-bit model checkpoint for the lowest memory footprint.
Convergence and quality
Convergence and quality
- Start with learning rates in the range
1e-5to2e-5for LoRA. QLoRA often benefits from slightly higher rates (2e-4) because the base model is already compressed. - Increase
--lora-rankto16or32for more expressive adapters on complex tasks. Higher rank increases parameter count and memory use. - Use
--train-on-completionsto mask the prompt tokens from the loss — this focuses training on the model’s output quality and often improves convergence on instruction-following tasks. - Add
--train-visiononly when the task requires the model to understand visual features it wasn’t trained on (e.g., domain-specific imagery like medical scans or satellite data). - Monitor
--steps-per-evalvalidation loss to detect overfitting early.
Hardware-specific guidance
Hardware-specific guidance
- On Apple Silicon, MLX automatically utilizes the unified memory architecture and maps operations to the GPU and Neural Engine. No additional configuration is needed.
- For models larger than 11B parameters, always enable
--grad-checkpoint. - Use
--image-resize-shapeto cap image resolution and reduce the sequence length fed into the vision encoder, which directly reduces memory and speeds up training. - Larger batch sizes improve GPU utilization up to a point; if you have memory headroom, increasing
--batch-sizeis often more effective than increasing--gradient-accumulation-steps.