Vision-Language Models

Overview

Cactus supports vision-language models (VLMs) that can process both images and text in the same conversation. These models understand visual content and can answer questions, describe images, and perform visual reasoning.

Supported Models

Model	Size	Features	NPU
LFM2-VL-450M	450M	Vision + text, embeddings	✅
LFM2.5-VL-1.6B	1.6B	High quality VLM	✅

Vision models include Apple NPU support for 5-11x faster image encoding on iOS/macOS.

Basic Image Understanding

Python Example

from cactus import cactus_init, cactus_complete, cactus_destroy
import json

model = cactus_init("weights/lfm2-vl-450m", None, False)

messages = json.dumps([{
    "role": "user",
    "content": "Describe this image in detail",
    "images": ["photo.jpg"]
}])

result = json.loads(cactus_complete(model, messages, None, None, None))
print(result["response"])

cactus_destroy(model)

Swift Example

import Foundation

let model = try cactusInit("/path/to/weights/lfm2-vl-1.6b", nil, false)
defer { cactusDestroy(model) }

let messages = #"
[
    {
        "role": "user",
        "content": "What objects are in this image?",
        "images": ["/path/to/image.jpg"]
    }
]
"#

let resultJson = try cactusComplete(model, messages, nil, nil, nil)
if let data = resultJson.data(using: .utf8),
   let result = try? JSONSerialization.jsonObject(with: data) as? [String: Any] {
    print(result["response"] as? String ?? "")
}

Multiple Images

Process multiple images in a single message:

messages = json.dumps([{
    "role": "user",
    "content": "Compare these two images. What are the differences?",
    "images": ["image1.jpg", "image2.jpg"]
}])

result = json.loads(cactus_complete(model, messages, None, None, None))
print(result["response"])

Multi-Turn Visual Conversations

Maintain context across multiple turns:

conversation = []

# First turn: Show image
conversation.append({
    "role": "user",
    "content": "What's in this image?",
    "images": ["scene.jpg"]
})

messages = json.dumps(conversation)
result = json.loads(cactus_complete(model, messages, None, None, None))
conversation.append({"role": "assistant", "content": result["response"]})

# Second turn: Follow-up question (no new image)
conversation.append({
    "role": "user",
    "content": "What color is the car?"
})

messages = json.dumps(conversation)
result = json.loads(cactus_complete(model, messages, None, None, None))
print(result["response"])

Image Formats

Supported image formats:

JPEG (.jpg, .jpeg)
PNG (.png)
BMP (.bmp)
GIF (.gif)
WebP (.webp)

Images are automatically resized and preprocessed. Max resolution depends on the model:

LFM2-VL-450M: 512x512 tiles with smart cropping
LFM2.5-VL-1.6B: 2048x2048 global with 512x512 tiles

Visual Question Answering

tasks = [
    "How many people are in this photo?",
    "What is the dominant color?",
    "Is this photo taken indoors or outdoors?",
    "What time of day is it?"
]

for question in tasks:
    messages = json.dumps([{
        "role": "user",
        "content": question,
        "images": ["photo.jpg"]
    }])
    
    result = json.loads(cactus_complete(model, messages, None, None, None))
    print(f"Q: {question}")
    print(f"A: {result['response']}\n")

Image Captioning

def caption_image(image_path, detail_level="medium"):
    prompts = {
        "brief": "Caption this image in one short sentence.",
        "medium": "Describe this image.",
        "detailed": "Provide a detailed description of everything visible in this image."
    }
    
    messages = json.dumps([{
        "role": "user",
        "content": prompts[detail_level],
        "images": [image_path]
    }])
    
    result = json.loads(cactus_complete(model, messages, None, None, None))
    return result["response"]

brief = caption_image("photo.jpg", "brief")
detailed = caption_image("photo.jpg", "detailed")
print(f"Brief: {brief}")
print(f"Detailed: {detailed}")

Object Detection and Counting

messages = json.dumps([{
    "role": "user",
    "content": "List all objects visible in this image with their approximate locations.",
    "images": ["scene.jpg"]
}])

result = json.loads(cactus_complete(model, messages, None, None, None))
print(result["response"])

OCR and Text Extraction

messages = json.dumps([{
    "role": "user",
    "content": "Extract all text visible in this image.",
    "images": ["document.jpg"]
}])

result = json.loads(cactus_complete(model, messages, None, None, None))
print(result["response"])

Image Embeddings

Generate vector embeddings for images:

from cactus import cactus_image_embed
import numpy as np

# Generate image embedding
embedding = cactus_image_embed(model, "image.jpg")
print(f"Embedding dimension: {len(embedding)}")

# Compute similarity between images
img1_emb = cactus_image_embed(model, "img1.jpg")
img2_emb = cactus_image_embed(model, "img2.jpg")

similarity = np.dot(img1_emb, img2_emb)
print(f"Similarity: {similarity:.3f}")

Performance Benchmarks

Vision encoding latency on real devices (256px image):

Device	LFM2-VL-450M	LFM2.5-VL-1.6B
iPhone 17 Pro	0.3s @ 48 tps	0.3s @ 48 tps
Mac M4 Pro	0.2s @ 98 tps	0.2s @ 98 tps
iPad M3	0.3s @ 69 tps	0.3s @ 69 tps

NPU acceleration is automatically enabled for image encoding on Apple devices with Neural Engine.

Memory Usage

Model	INT4	INT8	FP16
LFM2-VL-450M	108MB	180MB	360MB
LFM2.5-VL-1.6B	390MB	640MB	1.3GB

Best Practices

For better results:

Use high-resolution images (at least 512x512)
Ensure good lighting and clarity
Crop to focus on relevant content
Ask specific, clear questions

Limitations

Vision models may struggle with:

Very small text or fine details
Complex spatial reasoning
Precise counting of many objects
Understanding context requiring world knowledge

Error Handling

try:
    result = json.loads(cactus_complete(model, messages, None, None, None))
    if not result["success"]:
        print(f"Error: {result['error']}")
except RuntimeError as e:
    print(f"Failed to process image: {e}")

Next Steps

Chat Completion

Learn about text-only conversations

Embeddings

Generate multimodal embeddings

Supported Models

Browse all vision-language models

API Reference

Complete API documentation

Get Started

Core Concepts

Guides

Platform SDKs

Advanced

Vision-Language Models

Overview

Supported Models

Basic Image Understanding

Python Example

Swift Example

Multiple Images

Multi-Turn Visual Conversations

Image Formats

Visual Question Answering

Image Captioning

Object Detection and Counting

OCR and Text Extraction

Image Embeddings

Performance Benchmarks

Memory Usage

Best Practices

Limitations

Error Handling

Next Steps

Chat Completion

Embeddings

Supported Models

API Reference

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Platform SDKs

Advanced

Documentation Index

​Overview

​Supported Models

​Basic Image Understanding

​Python Example

​Swift Example

​Multiple Images

​Multi-Turn Visual Conversations

​Image Formats

​Visual Question Answering

​Image Captioning

​Object Detection and Counting

​OCR and Text Extraction

​Image Embeddings

​Performance Benchmarks

​Memory Usage

​Best Practices

​Limitations

​Error Handling

​Next Steps

Chat Completion

Embeddings

Supported Models

API Reference

Build docs developers (and LLMs) love

Overview

Supported Models

Basic Image Understanding

Python Example

Swift Example

Multiple Images

Multi-Turn Visual Conversations

Image Formats

Visual Question Answering

Image Captioning

Object Detection and Counting

OCR and Text Extraction

Image Embeddings

Performance Benchmarks

Memory Usage

Best Practices

Limitations

Error Handling

Next Steps