WhisperKit supports loading custom fine-tuned Whisper models, allowing you to deploy specialized models optimized for your specific use case, domain, or language.
Model Requirements
Custom models must be converted to CoreML format compatible with WhisperKit. The models consist of:
Audio Encoder Mel spectrogram to embeddings
Text Decoder Embeddings to text tokens
Tokenizer Text encoding/decoding
The whisperkittools Python package provides utilities to:
Convert Hugging Face Whisper models to CoreML
Fine-tune models on custom datasets
Optimize models for specific Apple devices
Deploy models to Hugging Face Hub
Installation
pip install git+https://github.com/argmaxinc/whisperkittools.git
Converting Models
From Hugging Face Hub
Convert any Whisper model from Hugging Face:
from whisperkit.convert import convert_whisper_to_coreml
convert_whisper_to_coreml(
model_name = "openai/whisper-large-v3" ,
output_dir = "./models" ,
compute_units = "cpuAndNeuralEngine"
)
From Local Checkpoint
Convert a locally fine-tuned model:
from whisperkit.convert import convert_whisper_to_coreml
convert_whisper_to_coreml(
model_path = "./my-finetuned-whisper" ,
output_dir = "./models" ,
model_name = "custom-whisper-medical" ,
compute_units = "cpuAndNeuralEngine"
)
Conversion Options
Hugging Face model ID (e.g., openai/whisper-large-v3)
Path to local model checkpoint
Directory to save converted models
compute_units
string
default: "cpuAndNeuralEngine"
Target compute units: cpuOnly, cpuAndGPU, cpuAndNeuralEngine, all
Quantization mode: linear, palettize, or none
Fine-Tuning Models
Preparing Your Dataset
Dataset should be in Hugging Face Datasets format with audio and transcription:
from datasets import Dataset, Audio
dataset = Dataset.from_dict({
"audio" : [ "audio1.wav" , "audio2.wav" ],
"text" : [ "Transcription one." , "Transcription two." ]
}).cast_column( "audio" , Audio( sampling_rate = 16000 ))
Training Example
from transformers import WhisperForConditionalGeneration, WhisperProcessor
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments
# Load base model
model = WhisperForConditionalGeneration.from_pretrained( "openai/whisper-small" )
processor = WhisperProcessor.from_pretrained( "openai/whisper-small" )
# Configure training
training_args = Seq2SeqTrainingArguments(
output_dir = "./whisper-finetuned" ,
per_device_train_batch_size = 16 ,
gradient_accumulation_steps = 2 ,
learning_rate = 1e-5 ,
num_train_epochs = 3 ,
fp16 = True ,
evaluation_strategy = "steps" ,
save_strategy = "steps" ,
save_steps = 500 ,
eval_steps = 500 ,
logging_steps = 100 ,
)
# Train model
trainer = Seq2SeqTrainer(
model = model,
args = training_args,
train_dataset = train_dataset,
eval_dataset = eval_dataset,
tokenizer = processor.feature_extractor,
)
trainer.train()
Deploying to Hugging Face
After converting your model to CoreML, upload to Hugging Face Hub:
from huggingface_hub import HfApi
api = HfApi()
api.upload_folder(
folder_path = "./models/custom-whisper-medical" ,
repo_id = "username/custom-whisper-medical" ,
repo_type = "model" ,
)
Loading Custom Models
From Hugging Face Hub
Once uploaded, load your custom model in WhisperKit:
import WhisperKit
let config = WhisperKitConfig (
model : "custom-whisper-medical" ,
modelRepo : "username/custom-whisper-medical"
)
let pipe = try await WhisperKit (config)
let result = try await pipe. transcribe ( audioPath : "patient_recording.wav" )
From Local Path
Load models from local filesystem:
import WhisperKit
let config = WhisperKitConfig (
modelFolder : "/path/to/models/custom-whisper-medical"
)
let pipe = try await WhisperKit (config)
With Compute Options
import WhisperKit
import CoreML
let computeOptions = ModelComputeOptions (
melCompute : . cpuAndGPU ,
audioEncoderCompute : . cpuAndNeuralEngine ,
textDecoderCompute : . cpuAndNeuralEngine ,
prefillCompute : . cpuOnly
)
let config = WhisperKitConfig (
model : "custom-whisper-medical" ,
modelRepo : "username/custom-whisper-medical" ,
computeOptions : computeOptions
)
let pipe = try await WhisperKit (config)
Model Repository Structure
Your Hugging Face repository should follow this structure:
username/custom-whisper-medical/
├── AudioEncoder.mlmodelc/
│ └── (CoreML compiled model)
├── TextDecoder.mlmodelc/
│ └── (CoreML compiled model)
├── MelSpectrogram.mlmodelc/
│ └── (CoreML compiled model)
├── generation_config.json
├── config.json
├── tokenizer.json
├── merges.txt
├── vocab.json
└── README.md
Model Variants
WhisperKit supports glob patterns for model selection:
// Select any distil large-v3 variant
let config = WhisperKitConfig (
model : "distil*large-v3" ,
modelRepo : "argmaxinc/whisperkit-coreml"
)
Common prefixes:
openai_whisper-* - Original OpenAI models
distil-whisper-* - Distilled models (faster, slightly lower accuracy)
Optimization Techniques
Quantization
Reduce model size and improve inference speed:
from whisperkit.convert import convert_whisper_to_coreml
convert_whisper_to_coreml(
model_name = "openai/whisper-large-v3" ,
output_dir = "./models" ,
quantize = "linear" , # Linear quantization
# or
quantize = "palettize" # Palettization (better compression)
)
Model Pruning
Remove unnecessary weights during fine-tuning:
from transformers import WhisperForConditionalGeneration
import torch
model = WhisperForConditionalGeneration.from_pretrained( "openai/whisper-small" )
# Apply structured pruning
from torch.nn.utils import prune
for module in model.modules():
if isinstance (module, torch.nn.Linear):
prune.l1_unstructured(module, name = "weight" , amount = 0.3 )
Knowledge Distillation
Create smaller models from larger ones:
from whisperkit.distillation import distill_model
teacher_model = WhisperForConditionalGeneration.from_pretrained( "openai/whisper-large-v3" )
student_model = WhisperForConditionalGeneration.from_pretrained( "openai/whisper-small" )
distill_model(
teacher = teacher_model,
student = student_model,
train_dataset = train_dataset,
temperature = 2.0 ,
alpha = 0.5
)
Testing Custom Models
CLI Testing
swift run whisperkit-cli transcribe \
--model-path "models/custom-whisper-medical" \
--audio-path "test_audio.wav" \
--verbose
Programmatic Testing
import WhisperKit
Task {
let config = WhisperKitConfig (
modelFolder : "models/custom-whisper-medical" ,
verbose : true
)
let pipe = try await WhisperKit (config)
// Test multiple files
let testFiles = [
"test1.wav" ,
"test2.wav" ,
"test3.wav"
]
for file in testFiles {
let result = try await pipe. transcribe ( audioPath : file)
print ( " \( file ) : \( result ? . text ?? "Failed" ) " )
}
}
Benchmarking
Compare your custom model against baselines:
import WhisperKit
func benchmark ( modelPath : String , testFiles : [ String ]) async throws {
let config = WhisperKitConfig ( modelFolder : modelPath)
let pipe = try await WhisperKit (config)
var totalTime: Double = 0
var totalTokens: Int = 0
for file in testFiles {
let start = Date ()
let result = try await pipe. transcribe ( audioPath : file)
let elapsed = Date (). timeIntervalSince (start)
totalTime += elapsed
totalTokens += result ? . segments . reduce ( 0 ) { $0 + $1 . tokens . count } ?? 0
}
print ( "Tokens per second: \( Double (totalTokens) / totalTime ) " )
print ( "Real-time factor: \( totalTime / getTotalAudioDuration (testFiles) ) " )
}
Best Practices
Model Selection
Start with openai/whisper-small for fine-tuning (good balance)
Use large-v3 for highest accuracy, tiny for fastest inference
Consider distil models for production (2-3x faster)
Fine-Tuning
Use domain-specific data (medical, legal, technical)
Include background noise similar to deployment environment
Balance dataset across accents and speakers
Fine-tune for 2-5 epochs to avoid overfitting
Optimization
Apply quantization for models > 200MB
Target cpuAndNeuralEngine for macOS 14+ deployment
Use cpuAndGPU for older macOS versions
Test on target devices before deployment
Validation
Measure Word Error Rate (WER) on held-out test set
Test edge cases (accents, noise, domain terms)
Compare against baseline OpenAI models
Profile memory usage and inference time
Example: Medical Transcription Model
Complete workflow for creating a medical transcription model:
# 1. Prepare dataset
from datasets import load_dataset
dataset = load_dataset( "medical-transcriptions" , split = "train" )
train_test = dataset.train_test_split( test_size = 0.1 )
# 2. Fine-tune model
from transformers import WhisperForConditionalGeneration, Seq2SeqTrainer
model = WhisperForConditionalGeneration.from_pretrained( "openai/whisper-small" )
# ... training code ...
model.save_pretrained( "./whisper-medical" )
# 3. Convert to CoreML
from whisperkit.convert import convert_whisper_to_coreml
convert_whisper_to_coreml(
model_path = "./whisper-medical" ,
output_dir = "./models/whisper-medical-coreml" ,
compute_units = "cpuAndNeuralEngine" ,
quantize = "linear"
)
# 4. Upload to Hub
from huggingface_hub import HfApi
api = HfApi()
api.upload_folder(
folder_path = "./models/whisper-medical-coreml" ,
repo_id = "username/whisper-medical" ,
repo_type = "model"
)
// 5. Deploy in app
import WhisperKit
let config = WhisperKitConfig (
model : "whisper-medical" ,
modelRepo : "username/whisper-medical"
)
let pipe = try await WhisperKit (config)
let result = try await pipe. transcribe ( audioPath : "patient_note.wav" )
print (result ? . text ?? "" )
Resources
WhisperKit Tools Python toolkit for model conversion
Model Hub Pre-converted WhisperKit models
Hugging Face Whisper Browse available Whisper models
Fine-tuning Guide Official Whisper fine-tuning guide
Next Steps
Performance Optimization Optimize custom model performance
Memory Management Manage memory for large models