Supported Models

Cactus supports a growing list of state-of-the-art models optimized for mobile and edge devices. All models support INT4, INT8, and FP16 quantization.

Language Models

Text generation models for chat, completion, and tool calling.

Gemma 3 Models (Google)

Model	Size	Features	RAM (INT4)
google/gemma-3-270m-it	270M	completion	~200MB
google/functiongemma-270m-it	270M	completion, tools	~200MB
google/gemma-3-1b-it	1B	completion	~800MB

Architecture: Gemma decoder-only transformer Context Length: 8K tokens Best For: General chat, instruction followingDownload & Run:

cactus download google/gemma-3-270m-it --precision INT4
cactus run google/gemma-3-270m-it

Tool Calling Example:

import cactus

model = cactus.load("google/functiongemma-270m-it")
tools = [
    {
        "name": "get_weather",
        "description": "Get weather for a location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {"type": "string"}
            }
        }
    }
]

response = model.complete(
    messages=[{"role": "user", "content": "What's the weather in SF?"}],
    tools=tools
)
print(response.function_calls)  # [{"name": "get_weather", ...}]

LFM2 Models (Liquid AI)

Model	Size	Features	RAM (INT4)
LiquidAI/LFM2-350M	350M	completion, tools, embed	~250MB
LiquidAI/LFM2-700M	700M	completion, tools, embed	~500MB
LiquidAI/LFM2.5-1.2B-Thinking	1.2B	completion, tools, embed	~700MB
LiquidAI/LFM2.5-1.2B-Instruct	1.2B	completion, tools, embed	~700MB
LiquidAI/LFM2-2.6B	2.6B	completion, tools, embed	~1.8GB
LiquidAI/LFM2-8B-A1B	8B (1B active)	completion, tools, embed	~6GB

Architecture: Liquid Foundation Model (LFM) - MoE with liquid time-constant networks Context Length: 32K tokens Best For: Long context, reasoning, embeddingsBenchmarks (LFM2.5-1.2B-Instruct):

Device	Prefill	Decode	RAM
Mac M4 Pro	582 t/s	100 t/s	76MB
iPhone 17 Pro	327 t/s	48 t/s	108MB
Galaxy S25 Ultra	255 t/s	37 t/s	1.5GB

Download & Run:

cactus download LiquidAI/LFM2.5-1.2B-Instruct --precision INT4
cactus run LiquidAI/LFM2.5-1.2B-Instruct

Embeddings Example:

import cactus

model = cactus.load("LiquidAI/LFM2-1.2B")
embeddings = model.embed(
    texts=["Hello world", "Cactus is fast"],
    normalize=True
)
print(embeddings.shape)  # (2, 1024)

Qwen 3 Models

Model	Size	Features	RAM (INT4)
Qwen/Qwen3-0.6B	600M	completion, tools, embed	~400MB
Qwen/Qwen3-1.7B	1.7B	completion, tools, embed	~1.2GB
Qwen/Qwen3-Embedding-0.6B	600M	embed	~400MB

Architecture: Qwen decoder-only transformer Context Length: 32K tokens Best For: Multilingual, Chinese language tasksDownload & Run:

cactus download Qwen/Qwen3-0.6B --precision INT4
cactus run Qwen/Qwen3-0.6B

Vision Models

Multi-modal models that understand both text and images.

LFM2-VL (Liquid AI)

Model	Size	Features	RAM (INT4)
LiquidAI/LFM2-VL-450M	450M	vision, txt & img embed	~300MB
LiquidAI/LFM2.5-VL-1.6B	1.6B	vision, txt & img embed	~1.1GB

Architecture: LFM2 + SigLIP-2 vision encoder Image Resolution: Up to 2048px with dynamic tiling NPU Support: Apple NPU (iPhone, iPad, Mac)Benchmarks (LFM2.5-VL-1.6B):

Device	First Token	Decode
Mac M4 Pro	0.2s	98 t/s
iPad M3	0.3s	69 t/s
iPhone 17 Pro	0.3s	48 t/s
Galaxy S25 Ultra	-	34 t/s

Download & Run:

cactus download LiquidAI/LFM2-VL-450M --precision INT4
cactus run LiquidAI/LFM2-VL-450M

Usage Example:

import cactus

model = cactus.load("LiquidAI/LFM2-VL-450M")
response = model.complete(
    messages=[
        {
            "role": "user",
            "content": "Describe this image in detail",
            "images": ["photo.jpg"]
        }
    ]
)
print(response.text)

Image Embeddings:

# Get image embeddings for similarity search
img_emb = model.embed_image("photo.jpg")
txt_emb = model.embed("a photo of a cat")

similarity = cosine_similarity(img_emb, txt_emb)

Transcription Models

Speech-to-text models for audio transcription.

Whisper (OpenAI)
Parakeet (NVIDIA)
Moonshine (Useful Sensors)

Model	Size	Features	RAM (INT4)	NPU
openai/whisper-tiny	39M	transcription, embed	~100MB	✅
openai/whisper-base	74M	transcription, embed	~150MB	✅
openai/whisper-small	244M	transcription, embed	~200MB	✅
openai/whisper-medium	769M	transcription, embed	~600MB	✅

Languages: 99 languages (multilingual) Best For: Multilingual transcription, high accuracy NPU Support: Apple NPU on all modelsDownload & Run:

cactus download openai/whisper-small --precision INT4
cactus transcribe openai/whisper-small --file audio.wav

Live Transcription:

# Transcribe from microphone
cactus transcribe openai/whisper-small

Python API:

import cactus

model = cactus.load("openai/whisper-small")
result = model.transcribe("audio.wav")
print(result.text)
print(result.language)  # Detected language

Model	Size	Features	RAM (INT4)	NPU
nvidia/parakeet-ctc-0.6b	600M	transcription, embed	~400MB	✅
nvidia/parakeet-ctc-1.1b	1.1B	transcription, embed	~700MB	✅
nvidia/parakeet-tdt-0.6b-v3	600M	transcription, embed	~400MB	✅

Languages: English only Best For: Ultra-fast English transcription, lowest latency NPU Support: Apple NPUBenchmarks (Parakeet 1.1B, 30s audio):

Device	Latency	Decode Speed
Mac M4 Pro	0.1s	900k+ t/s
iPad M3	0.3s	800k+ t/s
iPhone 17 Pro	0.3s	300k+ t/s
Raspberry Pi 5	4.5s	180k+ t/s

Download & Run:

cactus download nvidia/parakeet-ctc-1.1b --precision INT4
cactus transcribe nvidia/parakeet-ctc-1.1b

Model	Size	Features	RAM	Precision
UsefulSensors/moonshine-base	61M	transcription, embed	~150MB	FP16

Languages: English only Best For: Smallest model, edge devices Note: Requires FP16 precision (no INT4/INT8 support)Download & Run:

cactus download UsefulSensors/moonshine-base --precision FP16
cactus transcribe UsefulSensors/moonshine-base --precision FP16

Specialized Models

Voice Activity Detection (VAD)

Model	Size	Features	RAM
snakers4/silero-vad	1.5M	vad	~10MB

Best For: Detecting speech in audio streams Use Case: Pre-processing before transcription

import cactus

vad = cactus.load("snakers4/silero-vad")
is_speech = vad.detect(audio_chunk)

Embedding Models

Model	Size	Features	RAM (INT4)
nomic-ai/nomic-embed-text-v2-moe	137M	embed	~100MB
Qwen/Qwen3-Embedding-0.6B	600M	embed	~400MB

Best For: Semantic search, RAG, similarity

import cactus

model = cactus.load("nomic-ai/nomic-embed-text-v2-moe")
embeddings = model.embed(
    texts=["query", "document 1", "document 2"],
    normalize=True
)

Model Download & Conversion

Downloading Models

Use the cactus download command to fetch and convert models:

# Download with default precision (INT4)
cactus download LiquidAI/LFM2-1.2B

# Specify precision
cactus download openai/whisper-small --precision INT8
cactus download Qwen/Qwen3-0.6B --precision FP16

# For gated models (requires HuggingFace token)
cactus download meta-llama/Llama-3.2-1B --token YOUR_HF_TOKEN

# Force reconversion from source
cactus download google/gemma-3-1b-it --reconvert

Converting Custom Models

Convert your own fine-tuned models:

# Convert from HuggingFace format
cactus convert ./my-model --precision INT4

# Convert with LoRA merge
cactus convert ./base-model --lora ./lora-adapter --precision INT4

# Convert from local directory
cactus convert /path/to/safetensors/dir ./output --precision INT8

Supported Architectures:

Gemma (1, 2, 3)
Qwen (2, 3)
LFM2 / LFM2.5
Whisper
Parakeet (CTC, TDT)
SigLIP-2 (vision encoders)

Model Storage

Models are stored in weights/ directory:

weights/
├── google-gemma-3-270m-it/
│   ├── config.json
│   ├── tokenizer.json
│   ├── layer.0.weight
│   ├── layer.1.weight
│   └── ...
├── LiquidAI-LFM2-1.2B/
└── openai-whisper-small/

Each model directory contains:

config.json: Model configuration
tokenizer.json: BPE tokenizer
*.weight: Memory-mapped weight files (one per layer)

RAM Usage & Performance

Memory Requirements by Precision

Precision	Memory per Param	1B Model	2.6B Model
INT4	0.5 bytes	~500MB	~1.3GB
INT8	1 byte	~1GB	~2.6GB
FP16	2 bytes	~2GB	~5.2GB

Recommendation: Use INT4 for best mobile experience. Quality loss is minimal (less than 1% on most benchmarks).

Device Recommendations

High-End Phones
Mid-Range Phones
Budget Phones

iPhone 15 Pro+, Galaxy S24 Ultra, Pixel 9 Pro

LFM2.5-1.2B (INT4) - Excellent
Gemma-3-1B (INT4) - Excellent
LFM2-VL-1.6B (INT4) - Good
Whisper-Small (INT4) - Excellent

Architecture

How Cactus’s three-layer design works

Quantization

INT4/INT8/FP16 precision guide

Engine API

Using models in your app

Fine-Tuning

Train custom models

Get Started

Core Concepts

Guides

Platform SDKs

Advanced

Supported Models

Supported Models

Language Models

Vision Models

Transcription Models

Specialized Models

Model Download & Conversion

Downloading Models

Converting Custom Models

Model Storage

RAM Usage & Performance

Memory Requirements by Precision

Device Recommendations

Architecture

Quantization

Engine API

Fine-Tuning

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Platform SDKs

Advanced

Documentation Index

​Supported Models

​Language Models

​Vision Models

​Transcription Models

​Specialized Models

​Model Download & Conversion

​Downloading Models

​Converting Custom Models

​Model Storage

​RAM Usage & Performance

​Memory Requirements by Precision

​Device Recommendations

​Related Resources

Architecture

Quantization

Engine API

Fine-Tuning

Build docs developers (and LLMs) love

Supported Models

Language Models

Vision Models

Transcription Models

Specialized Models

Model Download & Conversion

Downloading Models

Converting Custom Models

Model Storage

RAM Usage & Performance

Memory Requirements by Precision

Device Recommendations

Related Resources