Skip to main content

Overview

Cactus supports vision-language models (VLMs) that can process both images and text in the same conversation. These models understand visual content and can answer questions, describe images, and perform visual reasoning.

Supported Models

ModelSizeFeaturesNPU
LFM2-VL-450M450MVision + text, embeddings
LFM2.5-VL-1.6B1.6BHigh quality VLM
Vision models include Apple NPU support for 5-11x faster image encoding on iOS/macOS.

Basic Image Understanding

Python Example

from cactus import cactus_init, cactus_complete, cactus_destroy
import json

model = cactus_init("weights/lfm2-vl-450m", None, False)

messages = json.dumps([{
    "role": "user",
    "content": "Describe this image in detail",
    "images": ["photo.jpg"]
}])

result = json.loads(cactus_complete(model, messages, None, None, None))
print(result["response"])

cactus_destroy(model)

Swift Example

import Foundation

let model = try cactusInit("/path/to/weights/lfm2-vl-1.6b", nil, false)
defer { cactusDestroy(model) }

let messages = #"
[
    {
        "role": "user",
        "content": "What objects are in this image?",
        "images": ["/path/to/image.jpg"]
    }
]
"#

let resultJson = try cactusComplete(model, messages, nil, nil, nil)
if let data = resultJson.data(using: .utf8),
   let result = try? JSONSerialization.jsonObject(with: data) as? [String: Any] {
    print(result["response"] as? String ?? "")
}

Multiple Images

Process multiple images in a single message:
messages = json.dumps([{
    "role": "user",
    "content": "Compare these two images. What are the differences?",
    "images": ["image1.jpg", "image2.jpg"]
}])

result = json.loads(cactus_complete(model, messages, None, None, None))
print(result["response"])

Multi-Turn Visual Conversations

Maintain context across multiple turns:
conversation = []

# First turn: Show image
conversation.append({
    "role": "user",
    "content": "What's in this image?",
    "images": ["scene.jpg"]
})

messages = json.dumps(conversation)
result = json.loads(cactus_complete(model, messages, None, None, None))
conversation.append({"role": "assistant", "content": result["response"]})

# Second turn: Follow-up question (no new image)
conversation.append({
    "role": "user",
    "content": "What color is the car?"
})

messages = json.dumps(conversation)
result = json.loads(cactus_complete(model, messages, None, None, None))
print(result["response"])

Image Formats

Supported image formats:
  • JPEG (.jpg, .jpeg)
  • PNG (.png)
  • BMP (.bmp)
  • GIF (.gif)
  • WebP (.webp)
Images are automatically resized and preprocessed. Max resolution depends on the model:
  • LFM2-VL-450M: 512x512 tiles with smart cropping
  • LFM2.5-VL-1.6B: 2048x2048 global with 512x512 tiles

Visual Question Answering

tasks = [
    "How many people are in this photo?",
    "What is the dominant color?",
    "Is this photo taken indoors or outdoors?",
    "What time of day is it?"
]

for question in tasks:
    messages = json.dumps([{
        "role": "user",
        "content": question,
        "images": ["photo.jpg"]
    }])
    
    result = json.loads(cactus_complete(model, messages, None, None, None))
    print(f"Q: {question}")
    print(f"A: {result['response']}\n")

Image Captioning

def caption_image(image_path, detail_level="medium"):
    prompts = {
        "brief": "Caption this image in one short sentence.",
        "medium": "Describe this image.",
        "detailed": "Provide a detailed description of everything visible in this image."
    }
    
    messages = json.dumps([{
        "role": "user",
        "content": prompts[detail_level],
        "images": [image_path]
    }])
    
    result = json.loads(cactus_complete(model, messages, None, None, None))
    return result["response"]

brief = caption_image("photo.jpg", "brief")
detailed = caption_image("photo.jpg", "detailed")
print(f"Brief: {brief}")
print(f"Detailed: {detailed}")

Object Detection and Counting

messages = json.dumps([{
    "role": "user",
    "content": "List all objects visible in this image with their approximate locations.",
    "images": ["scene.jpg"]
}])

result = json.loads(cactus_complete(model, messages, None, None, None))
print(result["response"])

OCR and Text Extraction

messages = json.dumps([{
    "role": "user",
    "content": "Extract all text visible in this image.",
    "images": ["document.jpg"]
}])

result = json.loads(cactus_complete(model, messages, None, None, None))
print(result["response"])

Image Embeddings

Generate vector embeddings for images:
from cactus import cactus_image_embed
import numpy as np

# Generate image embedding
embedding = cactus_image_embed(model, "image.jpg")
print(f"Embedding dimension: {len(embedding)}")

# Compute similarity between images
img1_emb = cactus_image_embed(model, "img1.jpg")
img2_emb = cactus_image_embed(model, "img2.jpg")

similarity = np.dot(img1_emb, img2_emb)
print(f"Similarity: {similarity:.3f}")

Performance Benchmarks

Vision encoding latency on real devices (256px image):
DeviceLFM2-VL-450MLFM2.5-VL-1.6B
iPhone 17 Pro0.3s @ 48 tps0.3s @ 48 tps
Mac M4 Pro0.2s @ 98 tps0.2s @ 98 tps
iPad M30.3s @ 69 tps0.3s @ 69 tps
NPU acceleration is automatically enabled for image encoding on Apple devices with Neural Engine.

Memory Usage

ModelINT4INT8FP16
LFM2-VL-450M108MB180MB360MB
LFM2.5-VL-1.6B390MB640MB1.3GB

Best Practices

For better results:
  • Use high-resolution images (at least 512x512)
  • Ensure good lighting and clarity
  • Crop to focus on relevant content
  • Ask specific, clear questions

Limitations

Vision models may struggle with:
  • Very small text or fine details
  • Complex spatial reasoning
  • Precise counting of many objects
  • Understanding context requiring world knowledge

Error Handling

try:
    result = json.loads(cactus_complete(model, messages, None, None, None))
    if not result["success"]:
        print(f"Error: {result['error']}")
except RuntimeError as e:
    print(f"Failed to process image: {e}")

Next Steps

Chat Completion

Learn about text-only conversations

Embeddings

Generate multimodal embeddings

Supported Models

Browse all vision-language models

API Reference

Complete API documentation

Build docs developers (and LLMs) love