Documentation Index
Fetch the complete documentation index at: https://mintlify.com/cactus-compute/cactus/llms.txt
Use this file to discover all available pages before exploring further.
Custom Models
Cactus supports converting custom models and fine-tuned adapters for deployment to mobile devices. This guide covers model conversion, quantization options, and testing your custom models.Converting Models with LoRA
Thecactus convert command merges LoRA adapters with base models and converts them to Cactus format.
Basic Conversion
Command Options
| Flag | Description | Default |
|---|---|---|
--precision INT4|INT8|FP16 | Weight quantization level | INT4 |
--lora <path> | Path to LoRA adapter (local or HF Hub) | None |
--token <token> | HuggingFace API token for gated models | None |
--reconvert | Force reconversion from source | False |
Model Format Requirements
Supported Base Models
Cactus supports the following model architectures:- Gemma 3:
google/gemma-3-270m-it,google/gemma-3-1b-it - Qwen 3:
Qwen/Qwen3-0.6B,Qwen/Qwen3-1.7B - LFM 2/2.5:
LiquidAI/LFM2-350M,LiquidAI/LFM2.5-1.2B-Instruct,LiquidAI/LFM2-8B-A1B - SmolLM 2: Coming soon
LoRA Adapter Format
Your LoRA adapter must:- Be trained with Unsloth, PEFT, or compatible LoRA libraries
- Target standard transformer modules:
q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj - Use rank (r) between 8-64 (recommended: 16-32 for mobile)
- Include a valid
adapter_config.jsonfile
Weight Quantization Options
INT4 (Default)
- Benefits: ~50% memory reduction vs INT8, fastest inference
- Trade-offs: Minimal quality loss with task-specific fine-tunes
- Best for: Production deployment on budget devices
INT8
- Benefits: Better quality retention than INT4
- Trade-offs: 2x memory usage vs INT4
- Best for: Quality-critical applications, mid-range devices
- Performance: 60-70 tok/s on iPhone 17 Pro, 13-18 tok/s on Pixel 6a
FP16
- Benefits: Full precision, no quality loss
- Trade-offs: 4x memory usage vs INT4, slower inference
- Best for: Development, benchmarking, high-end devices only
Testing Converted Models
Interactive Testing (Mac/Linux)
Benchmark Mode
- Prefill tokens per second (TPS)
- Decode tokens per second
- Time to first token
- RAM usage
- Model confidence scores
Testing on iOS Device
- Xcode installed
- iPhone connected via USB
- Device unlocked and trusted
Testing on Android Device
- Android SDK/NDK installed
- USB debugging enabled
- Device connected via ADB
Example: End-to-End Custom Model
1. Train LoRA Adapter (Colab/GPU)
2. Convert for Cactus
3. Test Locally
4. Deploy to iOS
Performance Expectations
INT8 Qwen3-0.6B (Custom Fine-Tune)
| Device | Decode TPS | RAM Usage |
|---|---|---|
| iPhone 17 Pro | 60-70 | ~200MB |
| iPhone 13 Mini | 25-35 | ~400MB |
| Galaxy S25 Ultra | 30-40 | ~500MB |
| Pixel 6a | 13-18 | ~450MB |
| Raspberry Pi 5 | 10-15 | ~350MB |
INT8 Gemma3-270m (Task-Specific)
| Device | Decode TPS | RAM Usage |
|---|---|---|
| iPhone 17 Pro | 150+ | ~120MB |
| Raspberry Pi 5 | 23 | ~200MB |
Performance varies based on model complexity, context length, and device thermal state. Use
--benchmark flag for accurate measurements on your target device.Troubleshooting
Conversion Fails
adapter_config.json file.
Poor Quality After Conversion
- Try INT8 instead of INT4:
--precision INT8 - Verify adapter trained properly (check validation loss)
- Test with different prompts and temperatures
Model Too Large for Device
- Use INT4 quantization:
--precision INT4 - Try a smaller base model (e.g., Gemma3-270m instead of Qwen3-1.7B)
- Reduce context window at runtime (see Performance Tuning)
See Also
- Fine-Tuning Guide — Training LoRA adapters with Unsloth
- Performance Tuning — Optimize runtime performance
- Compatibility — Weight versioning and breaking changes