Fine-Tuning

Cactus enables you to train custom LoRA adapters on GPU and deploy them to mobile devices with minimal quality loss. This guide covers training with Unsloth, merging adapters, and deploying to phones.

Overview

The fine-tuning workflow:

Train on GPU — Use Unsloth on Google Colab or local GPU to train LoRA adapters
Merge & Convert — Use cactus convert to merge adapter with base model and quantize
Deploy to Mobile — Package converted model with your iOS/Android app
Run On-Device — Inference runs entirely on-device with Cactus engine

Training LoRA Adapters

Prerequisites

Google Colab with GPU (free tier works)
OR local machine with CUDA GPU
Unsloth library installed
Training dataset in instruction format

Basic Training Script

from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments

# Load base model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/gemma-3-270m-it",  # or Qwen3-0.6B, LFM2-350M
    max_seq_length=2048,
    load_in_4bit=True,
    dtype=None,
)

# Configure LoRA
model = FastLanguageModel.get_peft_model(
    model,
    r=16,                      # Rank: 16-32 recommended for mobile
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
    lora_alpha=16,
    lora_dropout=0,            # 0 is optimal for inference
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=42,
)

# Prepare dataset
dataset = dataset.map(lambda x: {
    "text": tokenizer.apply_chat_template(
        [{"role": "user", "content": x["input"]},
         {"role": "assistant", "content": x["output"]}],
        tokenize=False
    )
})

# Train
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    max_seq_length=2048,
    dataset_text_field="text",
    packing=False,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=10,
        num_train_epochs=3,
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=42,
        output_dir="outputs",
    ),
)

trainer.train()

# Save adapter
model.save_pretrained("my-lora-adapter")
tokenizer.save_pretrained("my-lora-adapter")

# Optional: Push to Hub
model.push_to_hub("username/my-lora-adapter", token="...")

Recommended Hyperparameters

For Mobile Deployment

Parameter	Recommended Value	Notes
`r` (rank)	16-32	Lower = smaller adapter, faster inference
`lora_alpha`	Same as rank	Typically set equal to rank
`lora_dropout`	0	Dropout hurts mobile inference
`max_seq_length`	2048	Balance memory and context
`learning_rate`	2e-4 to 5e-4	Higher for small datasets
`num_train_epochs`	3-5	More epochs for small datasets

Target Modules

Always include these projection layers:

target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                "gate_proj", "up_proj", "down_proj"]

Smaller Models for Mobile: Use Gemma3-270m, Qwen3-0.6B, or LFM2-350M as base models. Larger models (>1B params) may not run smoothly on budget devices.

Merging Adapters with Base Models

Setup Cactus

git clone https://github.com/cactus-compute/cactus
cd cactus
source ./setup

Merge and Convert

The cactus convert command merges your LoRA adapter with the base model and converts to Cactus format:

# From local adapter
cactus convert Qwen/Qwen3-0.6B ./my-qwen3-0.6b --lora ./my-lora-adapter

# From HuggingFace Hub
cactus convert Qwen/Qwen3-0.6B ./my-qwen3-0.6b --lora username/my-lora-adapter

# With INT8 quantization (better quality)
cactus convert Qwen/Qwen3-0.6B ./my-qwen3-0.6b \
  --lora ./my-lora-adapter \
  --precision INT8

# With HuggingFace token (for gated models)
cactus convert google/gemma-3-1b-it ./my-gemma3 \
  --lora ./my-lora-adapter \
  --token hf_...

Quantization During Merge

Precision	Memory	Quality	Best For
INT4	Lowest (1x)	Good	Production, budget devices
INT8	Medium (2x)	Better	Mid-range devices, quality-critical
FP16	Highest (4x)	Best	Development, high-end only

Recommendation: Start with INT8 for testing, switch to INT4 for production if quality is acceptable.

Cactus v1.15+ uses lossless quantization techniques, providing 1.5x performance improvement while maintaining quality.

Deployment to Mobile

iOS/macOS Deployment

1. Build Native Library

cactus build --apple

Output:

Build complete!
Static libraries:
  Device: /path/to/cactus/apple/libcactus-device.a
  Simulator: /path/to/cactus/apple/libcactus-simulator.a
XCFrameworks:
  iOS: /path/to/cactus/apple/cactus-ios.xcframework
  macOS: /path/to/cactus/apple/cactus-macos.xcframework

2. Add to Xcode Project

Copy my-qwen3-0.6b/ folder to your Xcode project
Link cactus-ios.xcframework in Frameworks, Libraries, and Embedded Content
Set framework to Embed & Sign

3. Use in Swift

import Foundation

class CactusModel {
    private var model: OpaquePointer?
    
    init(modelName: String) {
        let modelPath = Bundle.main.path(forResource: modelName, ofType: nil)!
        model = cactus_init(modelPath, nil, false)
    }
    
    func complete(messages: [[String: String]]) -> String {
        let jsonData = try! JSONSerialization.data(withJSONObject: messages)
        let messagesJson = String(data: jsonData, encoding: .utf8)!
        
        var response = [CChar](repeating: 0, count: 4096)
        cactus_complete(model, messagesJson, &response, response.count, 
                        nil, nil, nil, nil)
        
        return String(cString: response)
    }
    
    deinit {
        if let model = model {
            cactus_destroy(model)
        }
    }
}

// Usage
let model = CactusModel(modelName: "my-qwen3-0.6b")
let result = model.complete(messages: [
    ["role": "user", "content": "Hello!"]
])
print(result)

Android Deployment

1. Build Native Library

cactus build --android

Output:

Build complete!
Shared library: /path/to/cactus/android/libcactus.so
Static library: /path/to/cactus/android/libcactus.a

2. Add to Android Project

Copy libcactus.so to app/src/main/jniLibs/arm64-v8a/
Copy my-qwen3-0.6b/ folder to app/src/main/assets/

3. Use in Kotlin

class CactusWrapper {
    init {
        System.loadLibrary("cactus")
    }
    
    external fun init(modelPath: String, contextSize: Long, corpusDir: String?): Long
    external fun complete(model: Long, messagesJson: String, bufferSize: Int): String
    external fun destroy(model: Long)
}

class CactusModel(context: Context, modelName: String) {
    private val cactus = CactusWrapper()
    private val model: Long
    
    init {
        // Copy model from assets to cache
        val modelDir = File(context.cacheDir, modelName)
        copyAssetFolder(context, modelName, modelDir.absolutePath)
        
        model = cactus.init(modelDir.absolutePath, 2048, null)
    }
    
    fun complete(messages: List<Map<String, String>>): String {
        val messagesJson = JSONArray(messages).toString()
        return cactus.complete(model, messagesJson, 4096)
    }
    
    fun close() {
        cactus.destroy(model)
    }
}

// Usage
val model = CactusModel(context, "my-qwen3-0.6b")
val result = model.complete(listOf(
    mapOf("role" to "user", "content" to "Hello!")
))
println(result)
model.close()

Performance Benchmarks

INT8 Qwen3-0.6B Fine-Tune

Device	Decode TPS	RAM Usage
iPhone 17 Pro	60-70 tok/s	~200MB
iPhone 13 Mini	25-35 tok/s	~400MB
Galaxy S25 Ultra	30-40 tok/s	~500MB
Pixel 6a	13-18 tok/s	~450MB
Raspberry Pi 5	10-15 tok/s	~350MB

INT8 Gemma3-270m Task-Specific

Device	Decode TPS	RAM Usage
iPhone 17 Pro	150+ tok/s	~120MB
iPhone 13 Mini	80+ tok/s	~200MB
Raspberry Pi 5	23 tok/s	~200MB

Testing Your Fine-Tune

Local Testing (Mac/Linux)

# Interactive playground
cactus run ./my-qwen3-0.6b

# Benchmark mode
cactus test --model ./my-qwen3-0.6b --benchmark

On-Device Testing

# Test on connected iPhone
cactus test --model ./my-qwen3-0.6b --ios

# Test on connected Android phone
cactus test --model ./my-qwen3-0.6b --android

Device must be connected via USB, unlocked, and trusted. For iOS, Xcode must be installed. For Android, USB debugging must be enabled.

Best Practices

Training

Start small — Use Gemma3-270m or Qwen3-0.6B for mobile
Low rank — Use r=16 or r=32 to minimize adapter size
No dropout — Set lora_dropout=0 for inference
Validate quality — Test on holdout set before deployment

Deployment

Test quantization — Compare INT4 vs INT8 quality on your task
Measure on-device — Use cactus test --ios/--android for accurate benchmarks
Monitor memory — Check RAM usage under different context lengths
Thermal management — Long inference sessions may throttle on phones

Troubleshooting

Training Issues

Out of memory during training

# Reduce batch size
per_device_train_batch_size=1
gradient_accumulation_steps=8

# Use gradient checkpointing
use_gradient_checkpointing="unsloth"

Poor validation loss

Increase training epochs
Try higher learning rate (3e-4 to 5e-4)
Add more training data
Reduce rank if overfitting

Deployment Issues

Model too slow on device

Use INT4 quantization
Switch to smaller base model
Reduce KV cache window (see Performance Tuning)

Quality degraded after quantization

Use INT8 instead of INT4
Verify training quality first
Check adapter was properly merged

Get Started

Core Concepts

Guides

Platform SDKs

Advanced

Documentation Index

​Fine-Tuning

​Overview

​Training LoRA Adapters

​Prerequisites

​Basic Training Script

​Recommended Hyperparameters

​For Mobile Deployment

​Target Modules

​Merging Adapters with Base Models

​Setup Cactus

​Merge and Convert

​Quantization During Merge

​Deployment to Mobile

​iOS/macOS Deployment

​1. Build Native Library

​2. Add to Xcode Project

​3. Use in Swift

​Android Deployment

​1. Build Native Library

​2. Add to Android Project

​3. Use in Kotlin

​Performance Benchmarks

​INT8 Qwen3-0.6B Fine-Tune

​INT8 Gemma3-270m Task-Specific

​Testing Your Fine-Tune

​Local Testing (Mac/Linux)

​On-Device Testing

​Best Practices

​Training

​Deployment

​Troubleshooting

​Training Issues

​Deployment Issues

​See Also

Build docs developers (and LLMs) love

Fine-Tuning

Overview

Training LoRA Adapters

Prerequisites

Basic Training Script

Recommended Hyperparameters

For Mobile Deployment

Target Modules

Merging Adapters with Base Models

Setup Cactus

Merge and Convert

Quantization During Merge

Deployment to Mobile

iOS/macOS Deployment

1. Build Native Library

2. Add to Xcode Project

3. Use in Swift

Android Deployment

1. Build Native Library

2. Add to Android Project

3. Use in Kotlin

Performance Benchmarks

INT8 Qwen3-0.6B Fine-Tune

INT8 Gemma3-270m Task-Specific

Testing Your Fine-Tune

Local Testing (Mac/Linux)

On-Device Testing

Best Practices

Training

Deployment

Troubleshooting

Training Issues

Deployment Issues

See Also