Documentation Index Fetch the complete documentation index at: https://mintlify.com/cactus-compute/cactus/llms.txt
Use this file to discover all available pages before exploring further.
Supported Models
Cactus supports a growing list of state-of-the-art models optimized for mobile and edge devices. All models support INT4, INT8, and FP16 quantization.
Language Models
Text generation models for chat, completion, and tool calling.
Model Size Features RAM (INT4) google/gemma-3-270m-it 270M completion ~200MB google/functiongemma-270m-it 270M completion, tools ~200MB google/gemma-3-1b-it 1B completion ~800MB
Architecture: Gemma decoder-only transformer
Context Length: 8K tokens
Best For: General chat, instruction followingDownload & Run: cactus download google/gemma-3-270m-it --precision INT4
cactus run google/gemma-3-270m-it
Tool Calling Example: import cactus
model = cactus.load( "google/functiongemma-270m-it" )
tools = [
{
"name" : "get_weather" ,
"description" : "Get weather for a location" ,
"parameters" : {
"type" : "object" ,
"properties" : {
"location" : { "type" : "string" }
}
}
}
]
response = model.complete(
messages = [{ "role" : "user" , "content" : "What's the weather in SF?" }],
tools = tools
)
print (response.function_calls) # [{"name": "get_weather", ...}]
Model Size Features RAM (INT4) LiquidAI/LFM2-350M 350M completion, tools, embed ~250MB LiquidAI/LFM2-700M 700M completion, tools, embed ~500MB LiquidAI/LFM2.5-1.2B-Thinking 1.2B completion, tools, embed ~700MB LiquidAI/LFM2.5-1.2B-Instruct 1.2B completion, tools, embed ~700MB LiquidAI/LFM2-2.6B 2.6B completion, tools, embed ~1.8GB LiquidAI/LFM2-8B-A1B 8B (1B active) completion, tools, embed ~6GB
Architecture: Liquid Foundation Model (LFM) - MoE with liquid time-constant networks
Context Length: 32K tokens
Best For: Long context, reasoning, embeddingsBenchmarks (LFM2.5-1.2B-Instruct): Device Prefill Decode RAM Mac M4 Pro 582 t/s 100 t/s 76MB iPhone 17 Pro 327 t/s 48 t/s 108MB Galaxy S25 Ultra 255 t/s 37 t/s 1.5GB
Download & Run: cactus download LiquidAI/LFM2.5-1.2B-Instruct --precision INT4
cactus run LiquidAI/LFM2.5-1.2B-Instruct
Embeddings Example: import cactus
model = cactus.load( "LiquidAI/LFM2-1.2B" )
embeddings = model.embed(
texts = [ "Hello world" , "Cactus is fast" ],
normalize = True
)
print (embeddings.shape) # (2, 1024)
Model Size Features RAM (INT4) Qwen/Qwen3-0.6B 600M completion, tools, embed ~400MB Qwen/Qwen3-1.7B 1.7B completion, tools, embed ~1.2GB Qwen/Qwen3-Embedding-0.6B 600M embed ~400MB
Architecture: Qwen decoder-only transformer
Context Length: 32K tokens
Best For: Multilingual, Chinese language tasksDownload & Run: cactus download Qwen/Qwen3-0.6B --precision INT4
cactus run Qwen/Qwen3-0.6B
Vision Models
Multi-modal models that understand both text and images.
Model Size Features RAM (INT4) LiquidAI/LFM2-VL-450M 450M vision, txt & img embed ~300MB LiquidAI/LFM2.5-VL-1.6B 1.6B vision, txt & img embed ~1.1GB
Architecture: LFM2 + SigLIP-2 vision encoder
Image Resolution: Up to 2048px with dynamic tiling
NPU Support: Apple NPU (iPhone, iPad, Mac)Benchmarks (LFM2.5-VL-1.6B): Device First Token Decode Mac M4 Pro 0.2s 98 t/s iPad M3 0.3s 69 t/s iPhone 17 Pro 0.3s 48 t/s Galaxy S25 Ultra - 34 t/s
Download & Run: cactus download LiquidAI/LFM2-VL-450M --precision INT4
cactus run LiquidAI/LFM2-VL-450M
Usage Example: import cactus
model = cactus.load( "LiquidAI/LFM2-VL-450M" )
response = model.complete(
messages = [
{
"role" : "user" ,
"content" : "Describe this image in detail" ,
"images" : [ "photo.jpg" ]
}
]
)
print (response.text)
Image Embeddings: # Get image embeddings for similarity search
img_emb = model.embed_image( "photo.jpg" )
txt_emb = model.embed( "a photo of a cat" )
similarity = cosine_similarity(img_emb, txt_emb)
Transcription Models
Speech-to-text models for audio transcription.
Model Size Features RAM (INT4) NPU openai/whisper-tiny 39M transcription, embed ~100MB ✅ openai/whisper-base 74M transcription, embed ~150MB ✅ openai/whisper-small 244M transcription, embed ~200MB ✅ openai/whisper-medium 769M transcription, embed ~600MB ✅
Languages: 99 languages (multilingual)
Best For: Multilingual transcription, high accuracy
NPU Support: Apple NPU on all modelsDownload & Run: cactus download openai/whisper-small --precision INT4
cactus transcribe openai/whisper-small --file audio.wav
Live Transcription: # Transcribe from microphone
cactus transcribe openai/whisper-small
Python API: import cactus
model = cactus.load( "openai/whisper-small" )
result = model.transcribe( "audio.wav" )
print (result.text)
print (result.language) # Detected language
Model Size Features RAM (INT4) NPU nvidia/parakeet-ctc-0.6b 600M transcription, embed ~400MB ✅ nvidia/parakeet-ctc-1.1b 1.1B transcription, embed ~700MB ✅ nvidia/parakeet-tdt-0.6b-v3 600M transcription, embed ~400MB ✅
Languages: English only
Best For: Ultra-fast English transcription, lowest latency
NPU Support: Apple NPUBenchmarks (Parakeet 1.1B, 30s audio): Device Latency Decode Speed Mac M4 Pro 0.1s 900k+ t/s iPad M3 0.3s 800k+ t/s iPhone 17 Pro 0.3s 300k+ t/s Raspberry Pi 5 4.5s 180k+ t/s
Download & Run: cactus download nvidia/parakeet-ctc-1.1b --precision INT4
cactus transcribe nvidia/parakeet-ctc-1.1b
Model Size Features RAM Precision UsefulSensors/moonshine-base 61M transcription, embed ~150MB FP16
Languages: English only
Best For: Smallest model, edge devices
Note: Requires FP16 precision (no INT4/INT8 support)Download & Run: cactus download UsefulSensors/moonshine-base --precision FP16
cactus transcribe UsefulSensors/moonshine-base --precision FP16
Specialized Models
Voice Activity Detection (VAD)
Model Size Features RAM snakers4/silero-vad 1.5M vad ~10MB
Best For: Detecting speech in audio streams
Use Case: Pre-processing before transcriptionimport cactus
vad = cactus.load( "snakers4/silero-vad" )
is_speech = vad.detect(audio_chunk)
Model Size Features RAM (INT4) nomic-ai/nomic-embed-text-v2-moe 137M embed ~100MB Qwen/Qwen3-Embedding-0.6B 600M embed ~400MB
Best For: Semantic search, RAG, similarityimport cactus
model = cactus.load( "nomic-ai/nomic-embed-text-v2-moe" )
embeddings = model.embed(
texts = [ "query" , "document 1" , "document 2" ],
normalize = True
)
Model Download & Conversion
Downloading Models
Use the cactus download command to fetch and convert models:
# Download with default precision (INT4)
cactus download LiquidAI/LFM2-1.2B
# Specify precision
cactus download openai/whisper-small --precision INT8
cactus download Qwen/Qwen3-0.6B --precision FP16
# For gated models (requires HuggingFace token)
cactus download meta-llama/Llama-3.2-1B --token YOUR_HF_TOKEN
# Force reconversion from source
cactus download google/gemma-3-1b-it --reconvert
Converting Custom Models
Convert your own fine-tuned models:
# Convert from HuggingFace format
cactus convert ./my-model --precision INT4
# Convert with LoRA merge
cactus convert ./base-model --lora ./lora-adapter --precision INT4
# Convert from local directory
cactus convert /path/to/safetensors/dir ./output --precision INT8
Supported Architectures:
Gemma (1, 2, 3)
Qwen (2, 3)
LFM2 / LFM2.5
Whisper
Parakeet (CTC, TDT)
SigLIP-2 (vision encoders)
Model Storage
Models are stored in weights/ directory:
weights/
├── google-gemma-3-270m-it/
│ ├── config.json
│ ├── tokenizer.json
│ ├── layer.0.weight
│ ├── layer.1.weight
│ └── ...
├── LiquidAI-LFM2-1.2B/
└── openai-whisper-small/
Each model directory contains:
config.json: Model configuration
tokenizer.json: BPE tokenizer
*.weight: Memory-mapped weight files (one per layer)
Memory Requirements by Precision
Precision Memory per Param 1B Model 2.6B Model INT4 0.5 bytes ~500MB ~1.3GB INT8 1 byte ~1GB ~2.6GB FP16 2 bytes ~2GB ~5.2GB
Recommendation: Use INT4 for best mobile experience. Quality loss is minimal (less than 1% on most benchmarks).
Device Recommendations
High-End Phones
Mid-Range Phones
Budget Phones
iPhone 15 Pro+, Galaxy S24 Ultra, Pixel 9 Pro
LFM2.5-1.2B (INT4) - Excellent
Gemma-3-1B (INT4) - Excellent
LFM2-VL-1.6B (INT4) - Good
Whisper-Small (INT4) - Excellent
iPhone 13, Galaxy A55, Pixel 6a
Gemma-3-270M (INT4) - Excellent
LFM2-700M (INT4) - Good
LFM2-VL-450M (INT4) - Good
Whisper-Tiny/Base (INT4) - Excellent
iPhone SE, Galaxy A17, CMF Phone
Gemma-3-270M (INT4) - Good
LFM2-350M (INT4) - Good
Moonshine-Base (FP16) - Excellent
Whisper-Tiny (INT4) - Good
Architecture How Cactus’s three-layer design works
Quantization INT4/INT8/FP16 precision guide
Engine API Using models in your app
Fine-Tuning Train custom models