Skip to main content

Fine-Tuning

Cactus enables you to train custom LoRA adapters on GPU and deploy them to mobile devices with minimal quality loss. This guide covers training with Unsloth, merging adapters, and deploying to phones.

Overview

The fine-tuning workflow:
  1. Train on GPU — Use Unsloth on Google Colab or local GPU to train LoRA adapters
  2. Merge & Convert — Use cactus convert to merge adapter with base model and quantize
  3. Deploy to Mobile — Package converted model with your iOS/Android app
  4. Run On-Device — Inference runs entirely on-device with Cactus engine

Training LoRA Adapters

Prerequisites

  • Google Colab with GPU (free tier works)
  • OR local machine with CUDA GPU
  • Unsloth library installed
  • Training dataset in instruction format

Basic Training Script

from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments

# Load base model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/gemma-3-270m-it",  # or Qwen3-0.6B, LFM2-350M
    max_seq_length=2048,
    load_in_4bit=True,
    dtype=None,
)

# Configure LoRA
model = FastLanguageModel.get_peft_model(
    model,
    r=16,                      # Rank: 16-32 recommended for mobile
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
    lora_alpha=16,
    lora_dropout=0,            # 0 is optimal for inference
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=42,
)

# Prepare dataset
dataset = dataset.map(lambda x: {
    "text": tokenizer.apply_chat_template(
        [{"role": "user", "content": x["input"]},
         {"role": "assistant", "content": x["output"]}],
        tokenize=False
    )
})

# Train
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    max_seq_length=2048,
    dataset_text_field="text",
    packing=False,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=10,
        num_train_epochs=3,
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=42,
        output_dir="outputs",
    ),
)

trainer.train()

# Save adapter
model.save_pretrained("my-lora-adapter")
tokenizer.save_pretrained("my-lora-adapter")

# Optional: Push to Hub
model.push_to_hub("username/my-lora-adapter", token="...")

For Mobile Deployment

ParameterRecommended ValueNotes
r (rank)16-32Lower = smaller adapter, faster inference
lora_alphaSame as rankTypically set equal to rank
lora_dropout0Dropout hurts mobile inference
max_seq_length2048Balance memory and context
learning_rate2e-4 to 5e-4Higher for small datasets
num_train_epochs3-5More epochs for small datasets

Target Modules

Always include these projection layers:
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                "gate_proj", "up_proj", "down_proj"]
Smaller Models for Mobile: Use Gemma3-270m, Qwen3-0.6B, or LFM2-350M as base models. Larger models (>1B params) may not run smoothly on budget devices.

Merging Adapters with Base Models

Setup Cactus

git clone https://github.com/cactus-compute/cactus
cd cactus
source ./setup

Merge and Convert

The cactus convert command merges your LoRA adapter with the base model and converts to Cactus format:
# From local adapter
cactus convert Qwen/Qwen3-0.6B ./my-qwen3-0.6b --lora ./my-lora-adapter

# From HuggingFace Hub
cactus convert Qwen/Qwen3-0.6B ./my-qwen3-0.6b --lora username/my-lora-adapter

# With INT8 quantization (better quality)
cactus convert Qwen/Qwen3-0.6B ./my-qwen3-0.6b \
  --lora ./my-lora-adapter \
  --precision INT8

# With HuggingFace token (for gated models)
cactus convert google/gemma-3-1b-it ./my-gemma3 \
  --lora ./my-lora-adapter \
  --token hf_...

Quantization During Merge

PrecisionMemoryQualityBest For
INT4Lowest (1x)GoodProduction, budget devices
INT8Medium (2x)BetterMid-range devices, quality-critical
FP16Highest (4x)BestDevelopment, high-end only
Recommendation: Start with INT8 for testing, switch to INT4 for production if quality is acceptable.
Cactus v1.15+ uses lossless quantization techniques, providing 1.5x performance improvement while maintaining quality.

Deployment to Mobile

iOS/macOS Deployment

1. Build Native Library

cactus build --apple
Output:
Build complete!
Static libraries:
  Device: /path/to/cactus/apple/libcactus-device.a
  Simulator: /path/to/cactus/apple/libcactus-simulator.a
XCFrameworks:
  iOS: /path/to/cactus/apple/cactus-ios.xcframework
  macOS: /path/to/cactus/apple/cactus-macos.xcframework

2. Add to Xcode Project

  1. Copy my-qwen3-0.6b/ folder to your Xcode project
  2. Link cactus-ios.xcframework in Frameworks, Libraries, and Embedded Content
  3. Set framework to Embed & Sign

3. Use in Swift

import Foundation

class CactusModel {
    private var model: OpaquePointer?
    
    init(modelName: String) {
        let modelPath = Bundle.main.path(forResource: modelName, ofType: nil)!
        model = cactus_init(modelPath, nil, false)
    }
    
    func complete(messages: [[String: String]]) -> String {
        let jsonData = try! JSONSerialization.data(withJSONObject: messages)
        let messagesJson = String(data: jsonData, encoding: .utf8)!
        
        var response = [CChar](repeating: 0, count: 4096)
        cactus_complete(model, messagesJson, &response, response.count, 
                        nil, nil, nil, nil)
        
        return String(cString: response)
    }
    
    deinit {
        if let model = model {
            cactus_destroy(model)
        }
    }
}

// Usage
let model = CactusModel(modelName: "my-qwen3-0.6b")
let result = model.complete(messages: [
    ["role": "user", "content": "Hello!"]
])
print(result)

Android Deployment

1. Build Native Library

cactus build --android
Output:
Build complete!
Shared library: /path/to/cactus/android/libcactus.so
Static library: /path/to/cactus/android/libcactus.a

2. Add to Android Project

  1. Copy libcactus.so to app/src/main/jniLibs/arm64-v8a/
  2. Copy my-qwen3-0.6b/ folder to app/src/main/assets/

3. Use in Kotlin

class CactusWrapper {
    init {
        System.loadLibrary("cactus")
    }
    
    external fun init(modelPath: String, contextSize: Long, corpusDir: String?): Long
    external fun complete(model: Long, messagesJson: String, bufferSize: Int): String
    external fun destroy(model: Long)
}

class CactusModel(context: Context, modelName: String) {
    private val cactus = CactusWrapper()
    private val model: Long
    
    init {
        // Copy model from assets to cache
        val modelDir = File(context.cacheDir, modelName)
        copyAssetFolder(context, modelName, modelDir.absolutePath)
        
        model = cactus.init(modelDir.absolutePath, 2048, null)
    }
    
    fun complete(messages: List<Map<String, String>>): String {
        val messagesJson = JSONArray(messages).toString()
        return cactus.complete(model, messagesJson, 4096)
    }
    
    fun close() {
        cactus.destroy(model)
    }
}

// Usage
val model = CactusModel(context, "my-qwen3-0.6b")
val result = model.complete(listOf(
    mapOf("role" to "user", "content" to "Hello!")
))
println(result)
model.close()

Performance Benchmarks

INT8 Qwen3-0.6B Fine-Tune

DeviceDecode TPSRAM Usage
iPhone 17 Pro60-70 tok/s~200MB
iPhone 13 Mini25-35 tok/s~400MB
Galaxy S25 Ultra30-40 tok/s~500MB
Pixel 6a13-18 tok/s~450MB
Raspberry Pi 510-15 tok/s~350MB

INT8 Gemma3-270m Task-Specific

DeviceDecode TPSRAM Usage
iPhone 17 Pro150+ tok/s~120MB
iPhone 13 Mini80+ tok/s~200MB
Raspberry Pi 523 tok/s~200MB

Testing Your Fine-Tune

Local Testing (Mac/Linux)

# Interactive playground
cactus run ./my-qwen3-0.6b

# Benchmark mode
cactus test --model ./my-qwen3-0.6b --benchmark

On-Device Testing

# Test on connected iPhone
cactus test --model ./my-qwen3-0.6b --ios

# Test on connected Android phone
cactus test --model ./my-qwen3-0.6b --android
Device must be connected via USB, unlocked, and trusted. For iOS, Xcode must be installed. For Android, USB debugging must be enabled.

Best Practices

Training

  1. Start small — Use Gemma3-270m or Qwen3-0.6B for mobile
  2. Low rank — Use r=16 or r=32 to minimize adapter size
  3. No dropout — Set lora_dropout=0 for inference
  4. Validate quality — Test on holdout set before deployment

Deployment

  1. Test quantization — Compare INT4 vs INT8 quality on your task
  2. Measure on-device — Use cactus test --ios/--android for accurate benchmarks
  3. Monitor memory — Check RAM usage under different context lengths
  4. Thermal management — Long inference sessions may throttle on phones

Troubleshooting

Training Issues

Out of memory during training
# Reduce batch size
per_device_train_batch_size=1
gradient_accumulation_steps=8

# Use gradient checkpointing
use_gradient_checkpointing="unsloth"
Poor validation loss
  • Increase training epochs
  • Try higher learning rate (3e-4 to 5e-4)
  • Add more training data
  • Reduce rank if overfitting

Deployment Issues

Model too slow on device
  • Use INT4 quantization
  • Switch to smaller base model
  • Reduce KV cache window (see Performance Tuning)
Quality degraded after quantization
  • Use INT8 instead of INT4
  • Verify training quality first
  • Check adapter was properly merged

See Also

Build docs developers (and LLMs) love