Skip to main content

Custom Models

Cactus supports converting custom models and fine-tuned adapters for deployment to mobile devices. This guide covers model conversion, quantization options, and testing your custom models.

Converting Models with LoRA

The cactus convert command merges LoRA adapters with base models and converts them to Cactus format.

Basic Conversion

# Convert from local LoRA adapter
cactus convert Qwen/Qwen3-0.6B ./my-qwen3-0.6b --lora ./my-lora-adapter

# Convert from HuggingFace Hub
cactus convert google/gemma-3-270m-it ./my-gemma3 --lora username/my-lora-adapter

# With specific quantization
cactus convert LiquidAI/LFM2.5-1.2B-Instruct ./my-lfm --lora ./adapters/my-lora --precision INT8

Command Options

FlagDescriptionDefault
--precision INT4|INT8|FP16Weight quantization levelINT4
--lora <path>Path to LoRA adapter (local or HF Hub)None
--token <token>HuggingFace API token for gated modelsNone
--reconvertForce reconversion from sourceFalse
Base Model Match: Always use the correct base model that matches your LoRA adapter. Mismatched base models will produce incorrect outputs.

Model Format Requirements

Supported Base Models

Cactus supports the following model architectures:
  • Gemma 3: google/gemma-3-270m-it, google/gemma-3-1b-it
  • Qwen 3: Qwen/Qwen3-0.6B, Qwen/Qwen3-1.7B
  • LFM 2/2.5: LiquidAI/LFM2-350M, LiquidAI/LFM2.5-1.2B-Instruct, LiquidAI/LFM2-8B-A1B
  • SmolLM 2: Coming soon
For the complete list, see the Supported Models section in the README.

LoRA Adapter Format

Your LoRA adapter must:
  • Be trained with Unsloth, PEFT, or compatible LoRA libraries
  • Target standard transformer modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
  • Use rank (r) between 8-64 (recommended: 16-32 for mobile)
  • Include a valid adapter_config.json file

Weight Quantization Options

INT4 (Default)

cactus convert Qwen/Qwen3-0.6B ./my-model --lora ./adapter --precision INT4
  • Benefits: ~50% memory reduction vs INT8, fastest inference
  • Trade-offs: Minimal quality loss with task-specific fine-tunes
  • Best for: Production deployment on budget devices

INT8

cactus convert Qwen/Qwen3-0.6B ./my-model --lora ./adapter --precision INT8
  • Benefits: Better quality retention than INT4
  • Trade-offs: 2x memory usage vs INT4
  • Best for: Quality-critical applications, mid-range devices
  • Performance: 60-70 tok/s on iPhone 17 Pro, 13-18 tok/s on Pixel 6a

FP16

cactus convert Qwen/Qwen3-0.6B ./my-model --lora ./adapter --precision FP16
  • Benefits: Full precision, no quality loss
  • Trade-offs: 4x memory usage vs INT4, slower inference
  • Best for: Development, benchmarking, high-end devices only
Quantization is Lossless: Cactus v1.15+ uses hybrid inference with lossless quantization techniques, providing 1.5x performance improvement while maintaining quality.

Testing Converted Models

Interactive Testing (Mac/Linux)

# Test your converted model locally
cactus run ./my-qwen3-0.6b
This opens an interactive playground where you can test completions, tool calls, and streaming.

Benchmark Mode

# Run performance benchmarks
cactus test --model ./my-qwen3-0.6b --benchmark
Outputs:
  • Prefill tokens per second (TPS)
  • Decode tokens per second
  • Time to first token
  • RAM usage
  • Model confidence scores

Testing on iOS Device

# Build and test on connected iPhone
cactus build --apple
cactus test --model ./my-model --ios
Requires:
  • Xcode installed
  • iPhone connected via USB
  • Device unlocked and trusted

Testing on Android Device

# Build and test on connected Android phone
cactus build --android
cactus test --model ./my-model --android
Requires:
  • Android SDK/NDK installed
  • USB debugging enabled
  • Device connected via ADB

Example: End-to-End Custom Model

1. Train LoRA Adapter (Colab/GPU)

from unsloth import FastLanguageModel
from trl import SFTTrainer

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen3-0.6B",
    max_seq_length=2048,
    load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    use_gradient_checkpointing="unsloth",
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    max_seq_length=2048,
)

trainer.train()
model.save_pretrained("./task-specific-adapter")

2. Convert for Cactus

cactus convert Qwen/Qwen3-0.6B ./qwen3-task-specific \
  --lora ./task-specific-adapter \
  --precision INT8

3. Test Locally

cactus run ./qwen3-task-specific

4. Deploy to iOS

cactus build --apple
# Copy model folder to Xcode project
# Link cactus-ios.xcframework
let modelPath = Bundle.main.path(forResource: "qwen3-task-specific", ofType: nil)!
let model = cactus_init(modelPath, nil, false)

let messages = "[{\"role\":\"user\",\"content\":\"Your query\"}]"
var response = [CChar](repeating: 0, count: 4096)
cactus_complete(model, messages, &response, response.count, nil, nil, nil, nil)

print(String(cString: response))
cactus_destroy(model)

Performance Expectations

INT8 Qwen3-0.6B (Custom Fine-Tune)

DeviceDecode TPSRAM Usage
iPhone 17 Pro60-70~200MB
iPhone 13 Mini25-35~400MB
Galaxy S25 Ultra30-40~500MB
Pixel 6a13-18~450MB
Raspberry Pi 510-15~350MB

INT8 Gemma3-270m (Task-Specific)

DeviceDecode TPSRAM Usage
iPhone 17 Pro150+~120MB
Raspberry Pi 523~200MB
Performance varies based on model complexity, context length, and device thermal state. Use --benchmark flag for accurate measurements on your target device.

Troubleshooting

Conversion Fails

Error: Base model architecture mismatch
Solution: Verify your LoRA adapter was trained on the exact base model you’re converting. Check the adapter’s adapter_config.json file.

Poor Quality After Conversion

  1. Try INT8 instead of INT4: --precision INT8
  2. Verify adapter trained properly (check validation loss)
  3. Test with different prompts and temperatures

Model Too Large for Device

  1. Use INT4 quantization: --precision INT4
  2. Try a smaller base model (e.g., Gemma3-270m instead of Qwen3-1.7B)
  3. Reduce context window at runtime (see Performance Tuning)

See Also

Build docs developers (and LLMs) love