Skip to main content

Supported Models

Cactus supports a growing list of state-of-the-art models optimized for mobile and edge devices. All models support INT4, INT8, and FP16 quantization.

Language Models

Text generation models for chat, completion, and tool calling.
ModelSizeFeaturesRAM (INT4)
google/gemma-3-270m-it270Mcompletion~200MB
google/functiongemma-270m-it270Mcompletion, tools~200MB
google/gemma-3-1b-it1Bcompletion~800MB
Architecture: Gemma decoder-only transformer Context Length: 8K tokens Best For: General chat, instruction followingDownload & Run:
cactus download google/gemma-3-270m-it --precision INT4
cactus run google/gemma-3-270m-it
Tool Calling Example:
import cactus

model = cactus.load("google/functiongemma-270m-it")
tools = [
    {
        "name": "get_weather",
        "description": "Get weather for a location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {"type": "string"}
            }
        }
    }
]

response = model.complete(
    messages=[{"role": "user", "content": "What's the weather in SF?"}],
    tools=tools
)
print(response.function_calls)  # [{"name": "get_weather", ...}]
ModelSizeFeaturesRAM (INT4)
LiquidAI/LFM2-350M350Mcompletion, tools, embed~250MB
LiquidAI/LFM2-700M700Mcompletion, tools, embed~500MB
LiquidAI/LFM2.5-1.2B-Thinking1.2Bcompletion, tools, embed~700MB
LiquidAI/LFM2.5-1.2B-Instruct1.2Bcompletion, tools, embed~700MB
LiquidAI/LFM2-2.6B2.6Bcompletion, tools, embed~1.8GB
LiquidAI/LFM2-8B-A1B8B (1B active)completion, tools, embed~6GB
Architecture: Liquid Foundation Model (LFM) - MoE with liquid time-constant networks Context Length: 32K tokens Best For: Long context, reasoning, embeddingsBenchmarks (LFM2.5-1.2B-Instruct):
DevicePrefillDecodeRAM
Mac M4 Pro582 t/s100 t/s76MB
iPhone 17 Pro327 t/s48 t/s108MB
Galaxy S25 Ultra255 t/s37 t/s1.5GB
Download & Run:
cactus download LiquidAI/LFM2.5-1.2B-Instruct --precision INT4
cactus run LiquidAI/LFM2.5-1.2B-Instruct
Embeddings Example:
import cactus

model = cactus.load("LiquidAI/LFM2-1.2B")
embeddings = model.embed(
    texts=["Hello world", "Cactus is fast"],
    normalize=True
)
print(embeddings.shape)  # (2, 1024)
ModelSizeFeaturesRAM (INT4)
Qwen/Qwen3-0.6B600Mcompletion, tools, embed~400MB
Qwen/Qwen3-1.7B1.7Bcompletion, tools, embed~1.2GB
Qwen/Qwen3-Embedding-0.6B600Membed~400MB
Architecture: Qwen decoder-only transformer Context Length: 32K tokens Best For: Multilingual, Chinese language tasksDownload & Run:
cactus download Qwen/Qwen3-0.6B --precision INT4
cactus run Qwen/Qwen3-0.6B

Vision Models

Multi-modal models that understand both text and images.
ModelSizeFeaturesRAM (INT4)
LiquidAI/LFM2-VL-450M450Mvision, txt & img embed~300MB
LiquidAI/LFM2.5-VL-1.6B1.6Bvision, txt & img embed~1.1GB
Architecture: LFM2 + SigLIP-2 vision encoder Image Resolution: Up to 2048px with dynamic tiling NPU Support: Apple NPU (iPhone, iPad, Mac)Benchmarks (LFM2.5-VL-1.6B):
DeviceFirst TokenDecode
Mac M4 Pro0.2s98 t/s
iPad M30.3s69 t/s
iPhone 17 Pro0.3s48 t/s
Galaxy S25 Ultra-34 t/s
Download & Run:
cactus download LiquidAI/LFM2-VL-450M --precision INT4
cactus run LiquidAI/LFM2-VL-450M
Usage Example:
import cactus

model = cactus.load("LiquidAI/LFM2-VL-450M")
response = model.complete(
    messages=[
        {
            "role": "user",
            "content": "Describe this image in detail",
            "images": ["photo.jpg"]
        }
    ]
)
print(response.text)
Image Embeddings:
# Get image embeddings for similarity search
img_emb = model.embed_image("photo.jpg")
txt_emb = model.embed("a photo of a cat")

similarity = cosine_similarity(img_emb, txt_emb)

Transcription Models

Speech-to-text models for audio transcription.
ModelSizeFeaturesRAM (INT4)NPU
openai/whisper-tiny39Mtranscription, embed~100MB
openai/whisper-base74Mtranscription, embed~150MB
openai/whisper-small244Mtranscription, embed~200MB
openai/whisper-medium769Mtranscription, embed~600MB
Languages: 99 languages (multilingual) Best For: Multilingual transcription, high accuracy NPU Support: Apple NPU on all modelsDownload & Run:
cactus download openai/whisper-small --precision INT4
cactus transcribe openai/whisper-small --file audio.wav
Live Transcription:
# Transcribe from microphone
cactus transcribe openai/whisper-small
Python API:
import cactus

model = cactus.load("openai/whisper-small")
result = model.transcribe("audio.wav")
print(result.text)
print(result.language)  # Detected language

Specialized Models

ModelSizeFeaturesRAM
snakers4/silero-vad1.5Mvad~10MB
Best For: Detecting speech in audio streams Use Case: Pre-processing before transcription
import cactus

vad = cactus.load("snakers4/silero-vad")
is_speech = vad.detect(audio_chunk)
ModelSizeFeaturesRAM (INT4)
nomic-ai/nomic-embed-text-v2-moe137Membed~100MB
Qwen/Qwen3-Embedding-0.6B600Membed~400MB
Best For: Semantic search, RAG, similarity
import cactus

model = cactus.load("nomic-ai/nomic-embed-text-v2-moe")
embeddings = model.embed(
    texts=["query", "document 1", "document 2"],
    normalize=True
)

Model Download & Conversion

Downloading Models

Use the cactus download command to fetch and convert models:
# Download with default precision (INT4)
cactus download LiquidAI/LFM2-1.2B

# Specify precision
cactus download openai/whisper-small --precision INT8
cactus download Qwen/Qwen3-0.6B --precision FP16

# For gated models (requires HuggingFace token)
cactus download meta-llama/Llama-3.2-1B --token YOUR_HF_TOKEN

# Force reconversion from source
cactus download google/gemma-3-1b-it --reconvert

Converting Custom Models

Convert your own fine-tuned models:
# Convert from HuggingFace format
cactus convert ./my-model --precision INT4

# Convert with LoRA merge
cactus convert ./base-model --lora ./lora-adapter --precision INT4

# Convert from local directory
cactus convert /path/to/safetensors/dir ./output --precision INT8
Supported Architectures:
  • Gemma (1, 2, 3)
  • Qwen (2, 3)
  • LFM2 / LFM2.5
  • Whisper
  • Parakeet (CTC, TDT)
  • SigLIP-2 (vision encoders)

Model Storage

Models are stored in weights/ directory:
weights/
├── google-gemma-3-270m-it/
│   ├── config.json
│   ├── tokenizer.json
│   ├── layer.0.weight
│   ├── layer.1.weight
│   └── ...
├── LiquidAI-LFM2-1.2B/
└── openai-whisper-small/
Each model directory contains:
  • config.json: Model configuration
  • tokenizer.json: BPE tokenizer
  • *.weight: Memory-mapped weight files (one per layer)

RAM Usage & Performance

Memory Requirements by Precision

PrecisionMemory per Param1B Model2.6B Model
INT40.5 bytes~500MB~1.3GB
INT81 byte~1GB~2.6GB
FP162 bytes~2GB~5.2GB
Recommendation: Use INT4 for best mobile experience. Quality loss is minimal (less than 1% on most benchmarks).

Device Recommendations

iPhone 15 Pro+, Galaxy S24 Ultra, Pixel 9 Pro
  • LFM2.5-1.2B (INT4) - Excellent
  • Gemma-3-1B (INT4) - Excellent
  • LFM2-VL-1.6B (INT4) - Good
  • Whisper-Small (INT4) - Excellent

Architecture

How Cactus’s three-layer design works

Quantization

INT4/INT8/FP16 precision guide

Engine API

Using models in your app

Fine-Tuning

Train custom models

Build docs developers (and LLMs) love