Fine-tuning vision language models

This tutorial shows you how to fine-tune a Small Vision Language Model to increase task-specific accuracy. We’ll use LFM2.5-VL-1.6B and fine-tune it for Optical Character Recognition (OCR) of mathematical formulas.

What you’ll learn

By the end of this tutorial, you’ll know how to:

Prepare vision-language datasets with images and text
Fine-tune multimodal models with Unsloth
Handle image preprocessing and tokenization
Test your model on vision tasks
Export vision models for deployment

Prerequisites

GPU: This tutorial requires a GPU with at least 16GB memory. You can run it for free on Google Colab using an NVIDIA T4 GPU
Python: Python 3.8 or higher
Basic knowledge: Familiarity with vision tasks and multimodal models

What is a vision language model?

Vision Language Models (VLMs) combine vision and language understanding:

Take images and text as input
Generate text descriptions or answers
Understand visual content in context
Perform tasks like OCR, VQA, image captioning

When to fine-tune VLMs

Fine-tune vision language models when you need:

Specialized visual understanding

Medical image analysis
Mathematical formula recognition
Document understanding
Technical diagram interpretation

Domain-specific OCR

Handwriting recognition
Specialized fonts or notation
Low-quality or noisy images
Multi-language text extraction

Visual question answering

Product information extraction
Chart and graph interpretation
Scene understanding for specific domains

Image-based generation

Image captioning with specific style
Visual code generation
Diagram-to-text conversion

Tutorial overview

The tutorial covers the following steps:

Installation: Set up Unsloth and vision model dependencies
Model loading: Load LFM2.5-VL-1.6B with vision capabilities
Data preparation: Format image-text pairs for training
Image preprocessing: Configure vision encoder settings
Training: Fine-tune on OCR task
Inference: Test on mathematical formula images
Export: Save your model for deployment

Key concepts

Vision-language architecture

LFM2.5-VL-1.6B consists of:

Vision encoder: Processes images into embeddings
Projection layer: Maps vision features to language space
Language model: Generates text based on vision+text input

Data format for VLMs

Vision-language datasets need:

{
    "image": PIL.Image or path,
    "query": "What is in this image?",
    "response": "Expected text output"
}

Images are automatically:

Resized to model’s expected resolution
Normalized with proper mean/std
Converted to model’s input format

The OCR task

Mathematical formula OCR is challenging because:

Complex notation and symbols
Spatial relationships matter (superscripts, fractions)
Variable fonts and handwriting styles
Mix of text and mathematical operators

Fine-tuning improves accuracy significantly for this specialized task.

Training configuration

LoRA for vision models

The tutorial uses LoRA for efficient training:

Apply LoRA to both text and vision components
Reduce memory requirements
Maintain vision-language alignment
Enable training on consumer GPUs

Important considerations

Vision encoder freezing:

Can freeze vision encoder and only train language model
Or train both for better adaptation
Depends on dataset size and domain shift

Batch size:

Images require more memory than text
Reduce batch size compared to text-only training
Use gradient accumulation for effective larger batches

Context length:

Vision tokens take up context window
Adjust max_seq_length accordingly
Balance between image detail and text length

Special requirements

Note the different transformers version:

!pip install git+https://github.com/huggingface/transformers.git@3c251772...

This is required because:

Vision model support is in development
Specific commit has VLM fixes
Will be in stable release soon

Deployment options

After fine-tuning, you can deploy your model to:

Mobile: Android and iOS apps using the LEAP SDK (vision support coming soon)
Desktop: Mac (MLX), Windows/Linux (custom inference)
Cloud: vLLM, Modal, Baseten, Fal for production deployments
Edge: On-device inference for vision applications

See the deployment documentation for detailed guides.

Run the tutorial

You can run this tutorial in two ways:

Google Colab (recommended): Click the “Open in Colab” badge at the top
Local environment: Clone the LFM Cookbook repository and run the notebook locally

Access the notebook

The complete notebook is available at:

GitHub: sft_for_vision_language_model.ipynb
Colab: Click the badge above to open directly in Google Colab

Adapting to your vision task

To apply this tutorial to your own vision task:

Collect image-text pairs: Gather (image, query, response) tuples
Format dataset: Create dataset with image and text columns
Adjust preprocessing: Modify image size if needed
Configure training: Adjust batch size for your GPU
Evaluate thoroughly: Test on diverse visual examples

Example use cases

What you can build with fine-tuned VLMs:

Document understanding

Extract information from forms
Parse invoices and receipts
Analyze document layouts

Medical imaging

Generate radiology reports
Identify abnormalities
Extract measurements from scans

Technical diagrams

Convert flowcharts to code
Extract data from plots
Interpret circuit diagrams

Visual search

Generate searchable descriptions
Extract product attributes
Classify visual content

Expected results

After fine-tuning on OCR tasks, expect:

Improved accuracy on target domain (typically 15-30% improvement)
Better symbol recognition for specialized notation
More consistent outputs following expected format
Reduced hallucinations on visual details

Next steps

After completing this tutorial, you can:

Apply to other vision tasks (VQA, captioning, etc.)
Experiment with different vision model sizes
Try multi-task training on various vision tasks
Deploy your model using the inference guides

Getting help

Need assistance? Join the Liquid AI Discord Community:

Overview

Local AI Apps

Mobile Deployment

Fine-Tuning

Community

Fine-tuning vision language models

What you’ll learn

Prerequisites

What is a vision language model?

When to fine-tune VLMs

Specialized visual understanding

Domain-specific OCR

Visual question answering

Image-based generation

Tutorial overview

Key concepts

Vision-language architecture

Data format for VLMs

The OCR task

Training configuration

LoRA for vision models

Important considerations

Special requirements

Deployment options

Run the tutorial

Access the notebook

Adapting to your vision task

Example use cases

Document understanding

Medical imaging

Technical diagrams

Visual search

Expected results

Next steps

Getting help

Build docs developers (and LLMs) love

Overview

Local AI Apps

Mobile Deployment

Fine-Tuning

Community

Documentation Index

​What you’ll learn

​Prerequisites

​What is a vision language model?

​When to fine-tune VLMs

​Specialized visual understanding

​Domain-specific OCR

​Visual question answering

​Image-based generation

​Tutorial overview

​Key concepts

​Vision-language architecture

​Data format for VLMs

​The OCR task

​Training configuration

​LoRA for vision models

​Important considerations

​Special requirements

​Deployment options

​Run the tutorial

​Access the notebook

​Adapting to your vision task

​Example use cases

​Document understanding

​Medical imaging

​Technical diagrams

​Visual search

​Expected results

​Next steps

​Getting help

Build docs developers (and LLMs) love

What you’ll learn

Prerequisites

What is a vision language model?

When to fine-tune VLMs

Specialized visual understanding

Domain-specific OCR

Visual question answering

Image-based generation

Tutorial overview

Key concepts

Vision-language architecture

Data format for VLMs

The OCR task

Training configuration

LoRA for vision models

Important considerations

Special requirements

Deployment options

Run the tutorial

Access the notebook

Adapting to your vision task

Example use cases

Document understanding

Medical imaging

Technical diagrams

Visual search

Expected results

Next steps

Getting help