Fine-Tuning
Cactus enables you to train custom LoRA adapters on GPU and deploy them to mobile devices with minimal quality loss. This guide covers training with Unsloth, merging adapters, and deploying to phones.
Overview
The fine-tuning workflow:
- Train on GPU — Use Unsloth on Google Colab or local GPU to train LoRA adapters
- Merge & Convert — Use
cactus convert to merge adapter with base model and quantize
- Deploy to Mobile — Package converted model with your iOS/Android app
- Run On-Device — Inference runs entirely on-device with Cactus engine
Training LoRA Adapters
Prerequisites
- Google Colab with GPU (free tier works)
- OR local machine with CUDA GPU
- Unsloth library installed
- Training dataset in instruction format
Basic Training Script
from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
# Load base model
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/gemma-3-270m-it", # or Qwen3-0.6B, LFM2-350M
max_seq_length=2048,
load_in_4bit=True,
dtype=None,
)
# Configure LoRA
model = FastLanguageModel.get_peft_model(
model,
r=16, # Rank: 16-32 recommended for mobile
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"
],
lora_alpha=16,
lora_dropout=0, # 0 is optimal for inference
bias="none",
use_gradient_checkpointing="unsloth",
random_state=42,
)
# Prepare dataset
dataset = dataset.map(lambda x: {
"text": tokenizer.apply_chat_template(
[{"role": "user", "content": x["input"]},
{"role": "assistant", "content": x["output"]}],
tokenize=False
)
})
# Train
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
max_seq_length=2048,
dataset_text_field="text",
packing=False,
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
warmup_steps=10,
num_train_epochs=3,
learning_rate=2e-4,
fp16=not torch.cuda.is_bf16_supported(),
bf16=torch.cuda.is_bf16_supported(),
logging_steps=1,
optim="adamw_8bit",
weight_decay=0.01,
lr_scheduler_type="linear",
seed=42,
output_dir="outputs",
),
)
trainer.train()
# Save adapter
model.save_pretrained("my-lora-adapter")
tokenizer.save_pretrained("my-lora-adapter")
# Optional: Push to Hub
model.push_to_hub("username/my-lora-adapter", token="...")
Recommended Hyperparameters
For Mobile Deployment
| Parameter | Recommended Value | Notes |
|---|
r (rank) | 16-32 | Lower = smaller adapter, faster inference |
lora_alpha | Same as rank | Typically set equal to rank |
lora_dropout | 0 | Dropout hurts mobile inference |
max_seq_length | 2048 | Balance memory and context |
learning_rate | 2e-4 to 5e-4 | Higher for small datasets |
num_train_epochs | 3-5 | More epochs for small datasets |
Target Modules
Always include these projection layers:
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"]
Smaller Models for Mobile: Use Gemma3-270m, Qwen3-0.6B, or LFM2-350M as base models. Larger models (>1B params) may not run smoothly on budget devices.
Merging Adapters with Base Models
Setup Cactus
git clone https://github.com/cactus-compute/cactus
cd cactus
source ./setup
Merge and Convert
The cactus convert command merges your LoRA adapter with the base model and converts to Cactus format:
# From local adapter
cactus convert Qwen/Qwen3-0.6B ./my-qwen3-0.6b --lora ./my-lora-adapter
# From HuggingFace Hub
cactus convert Qwen/Qwen3-0.6B ./my-qwen3-0.6b --lora username/my-lora-adapter
# With INT8 quantization (better quality)
cactus convert Qwen/Qwen3-0.6B ./my-qwen3-0.6b \
--lora ./my-lora-adapter \
--precision INT8
# With HuggingFace token (for gated models)
cactus convert google/gemma-3-1b-it ./my-gemma3 \
--lora ./my-lora-adapter \
--token hf_...
Quantization During Merge
| Precision | Memory | Quality | Best For |
|---|
| INT4 | Lowest (1x) | Good | Production, budget devices |
| INT8 | Medium (2x) | Better | Mid-range devices, quality-critical |
| FP16 | Highest (4x) | Best | Development, high-end only |
Recommendation: Start with INT8 for testing, switch to INT4 for production if quality is acceptable.
Cactus v1.15+ uses lossless quantization techniques, providing 1.5x performance improvement while maintaining quality.
Deployment to Mobile
iOS/macOS Deployment
1. Build Native Library
Output:
Build complete!
Static libraries:
Device: /path/to/cactus/apple/libcactus-device.a
Simulator: /path/to/cactus/apple/libcactus-simulator.a
XCFrameworks:
iOS: /path/to/cactus/apple/cactus-ios.xcframework
macOS: /path/to/cactus/apple/cactus-macos.xcframework
2. Add to Xcode Project
- Copy
my-qwen3-0.6b/ folder to your Xcode project
- Link
cactus-ios.xcframework in Frameworks, Libraries, and Embedded Content
- Set framework to Embed & Sign
3. Use in Swift
import Foundation
class CactusModel {
private var model: OpaquePointer?
init(modelName: String) {
let modelPath = Bundle.main.path(forResource: modelName, ofType: nil)!
model = cactus_init(modelPath, nil, false)
}
func complete(messages: [[String: String]]) -> String {
let jsonData = try! JSONSerialization.data(withJSONObject: messages)
let messagesJson = String(data: jsonData, encoding: .utf8)!
var response = [CChar](repeating: 0, count: 4096)
cactus_complete(model, messagesJson, &response, response.count,
nil, nil, nil, nil)
return String(cString: response)
}
deinit {
if let model = model {
cactus_destroy(model)
}
}
}
// Usage
let model = CactusModel(modelName: "my-qwen3-0.6b")
let result = model.complete(messages: [
["role": "user", "content": "Hello!"]
])
print(result)
Android Deployment
1. Build Native Library
Output:
Build complete!
Shared library: /path/to/cactus/android/libcactus.so
Static library: /path/to/cactus/android/libcactus.a
2. Add to Android Project
- Copy
libcactus.so to app/src/main/jniLibs/arm64-v8a/
- Copy
my-qwen3-0.6b/ folder to app/src/main/assets/
3. Use in Kotlin
class CactusWrapper {
init {
System.loadLibrary("cactus")
}
external fun init(modelPath: String, contextSize: Long, corpusDir: String?): Long
external fun complete(model: Long, messagesJson: String, bufferSize: Int): String
external fun destroy(model: Long)
}
class CactusModel(context: Context, modelName: String) {
private val cactus = CactusWrapper()
private val model: Long
init {
// Copy model from assets to cache
val modelDir = File(context.cacheDir, modelName)
copyAssetFolder(context, modelName, modelDir.absolutePath)
model = cactus.init(modelDir.absolutePath, 2048, null)
}
fun complete(messages: List<Map<String, String>>): String {
val messagesJson = JSONArray(messages).toString()
return cactus.complete(model, messagesJson, 4096)
}
fun close() {
cactus.destroy(model)
}
}
// Usage
val model = CactusModel(context, "my-qwen3-0.6b")
val result = model.complete(listOf(
mapOf("role" to "user", "content" to "Hello!")
))
println(result)
model.close()
INT8 Qwen3-0.6B Fine-Tune
| Device | Decode TPS | RAM Usage |
|---|
| iPhone 17 Pro | 60-70 tok/s | ~200MB |
| iPhone 13 Mini | 25-35 tok/s | ~400MB |
| Galaxy S25 Ultra | 30-40 tok/s | ~500MB |
| Pixel 6a | 13-18 tok/s | ~450MB |
| Raspberry Pi 5 | 10-15 tok/s | ~350MB |
INT8 Gemma3-270m Task-Specific
| Device | Decode TPS | RAM Usage |
|---|
| iPhone 17 Pro | 150+ tok/s | ~120MB |
| iPhone 13 Mini | 80+ tok/s | ~200MB |
| Raspberry Pi 5 | 23 tok/s | ~200MB |
Testing Your Fine-Tune
Local Testing (Mac/Linux)
# Interactive playground
cactus run ./my-qwen3-0.6b
# Benchmark mode
cactus test --model ./my-qwen3-0.6b --benchmark
On-Device Testing
# Test on connected iPhone
cactus test --model ./my-qwen3-0.6b --ios
# Test on connected Android phone
cactus test --model ./my-qwen3-0.6b --android
Device must be connected via USB, unlocked, and trusted. For iOS, Xcode must be installed. For Android, USB debugging must be enabled.
Best Practices
Training
- Start small — Use Gemma3-270m or Qwen3-0.6B for mobile
- Low rank — Use r=16 or r=32 to minimize adapter size
- No dropout — Set
lora_dropout=0 for inference
- Validate quality — Test on holdout set before deployment
Deployment
- Test quantization — Compare INT4 vs INT8 quality on your task
- Measure on-device — Use
cactus test --ios/--android for accurate benchmarks
- Monitor memory — Check RAM usage under different context lengths
- Thermal management — Long inference sessions may throttle on phones
Troubleshooting
Training Issues
Out of memory during training
# Reduce batch size
per_device_train_batch_size=1
gradient_accumulation_steps=8
# Use gradient checkpointing
use_gradient_checkpointing="unsloth"
Poor validation loss
- Increase training epochs
- Try higher learning rate (3e-4 to 5e-4)
- Add more training data
- Reduce rank if overfitting
Deployment Issues
Model too slow on device
- Use INT4 quantization
- Switch to smaller base model
- Reduce KV cache window (see Performance Tuning)
Quality degraded after quantization
- Use INT8 instead of INT4
- Verify training quality first
- Check adapter was properly merged
See Also