Skip to main content
WhisperKit supports loading custom fine-tuned Whisper models, allowing you to deploy specialized models optimized for your specific use case, domain, or language.

Model Requirements

Custom models must be converted to CoreML format compatible with WhisperKit. The models consist of:

Audio Encoder

Mel spectrogram to embeddings

Text Decoder

Embeddings to text tokens

Tokenizer

Text encoding/decoding

WhisperKit Tools

The whisperkittools Python package provides utilities to:
  • Convert Hugging Face Whisper models to CoreML
  • Fine-tune models on custom datasets
  • Optimize models for specific Apple devices
  • Deploy models to Hugging Face Hub

Installation

pip install git+https://github.com/argmaxinc/whisperkittools.git

Converting Models

From Hugging Face Hub

Convert any Whisper model from Hugging Face:
from whisperkit.convert import convert_whisper_to_coreml

convert_whisper_to_coreml(
    model_name="openai/whisper-large-v3",
    output_dir="./models",
    compute_units="cpuAndNeuralEngine"
)

From Local Checkpoint

Convert a locally fine-tuned model:
from whisperkit.convert import convert_whisper_to_coreml

convert_whisper_to_coreml(
    model_path="./my-finetuned-whisper",
    output_dir="./models",
    model_name="custom-whisper-medical",
    compute_units="cpuAndNeuralEngine"
)

Conversion Options

model_name
string
Hugging Face model ID (e.g., openai/whisper-large-v3)
model_path
string
Path to local model checkpoint
output_dir
string
Directory to save converted models
compute_units
string
default:"cpuAndNeuralEngine"
Target compute units: cpuOnly, cpuAndGPU, cpuAndNeuralEngine, all
quantize
string
Quantization mode: linear, palettize, or none

Fine-Tuning Models

Preparing Your Dataset

Dataset should be in Hugging Face Datasets format with audio and transcription:
from datasets import Dataset, Audio

dataset = Dataset.from_dict({
    "audio": ["audio1.wav", "audio2.wav"],
    "text": ["Transcription one.", "Transcription two."]
}).cast_column("audio", Audio(sampling_rate=16000))

Training Example

from transformers import WhisperForConditionalGeneration, WhisperProcessor
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

# Load base model
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
processor = WhisperProcessor.from_pretrained("openai/whisper-small")

# Configure training
training_args = Seq2SeqTrainingArguments(
    output_dir="./whisper-finetuned",
    per_device_train_batch_size=16,
    gradient_accumulation_steps=2,
    learning_rate=1e-5,
    num_train_epochs=3,
    fp16=True,
    evaluation_strategy="steps",
    save_strategy="steps",
    save_steps=500,
    eval_steps=500,
    logging_steps=100,
)

# Train model
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=processor.feature_extractor,
)

trainer.train()

Deploying to Hugging Face

After converting your model to CoreML, upload to Hugging Face Hub:
from huggingface_hub import HfApi

api = HfApi()
api.upload_folder(
    folder_path="./models/custom-whisper-medical",
    repo_id="username/custom-whisper-medical",
    repo_type="model",
)

Loading Custom Models

From Hugging Face Hub

Once uploaded, load your custom model in WhisperKit:
import WhisperKit

let config = WhisperKitConfig(
    model: "custom-whisper-medical",
    modelRepo: "username/custom-whisper-medical"
)

let pipe = try await WhisperKit(config)
let result = try await pipe.transcribe(audioPath: "patient_recording.wav")

From Local Path

Load models from local filesystem:
import WhisperKit

let config = WhisperKitConfig(
    modelFolder: "/path/to/models/custom-whisper-medical"
)

let pipe = try await WhisperKit(config)

With Compute Options

import WhisperKit
import CoreML

let computeOptions = ModelComputeOptions(
    melCompute: .cpuAndGPU,
    audioEncoderCompute: .cpuAndNeuralEngine,
    textDecoderCompute: .cpuAndNeuralEngine,
    prefillCompute: .cpuOnly
)

let config = WhisperKitConfig(
    model: "custom-whisper-medical",
    modelRepo: "username/custom-whisper-medical",
    computeOptions: computeOptions
)

let pipe = try await WhisperKit(config)

Model Repository Structure

Your Hugging Face repository should follow this structure:
username/custom-whisper-medical/
├── AudioEncoder.mlmodelc/
│   └── (CoreML compiled model)
├── TextDecoder.mlmodelc/
│   └── (CoreML compiled model)
├── MelSpectrogram.mlmodelc/
│   └── (CoreML compiled model)
├── generation_config.json
├── config.json
├── tokenizer.json
├── merges.txt
├── vocab.json
└── README.md

Model Variants

WhisperKit supports glob patterns for model selection:
// Select any distil large-v3 variant
let config = WhisperKitConfig(
    model: "distil*large-v3",
    modelRepo: "argmaxinc/whisperkit-coreml"
)
Common prefixes:
  • openai_whisper-* - Original OpenAI models
  • distil-whisper-* - Distilled models (faster, slightly lower accuracy)

Optimization Techniques

Quantization

Reduce model size and improve inference speed:
from whisperkit.convert import convert_whisper_to_coreml

convert_whisper_to_coreml(
    model_name="openai/whisper-large-v3",
    output_dir="./models",
    quantize="linear",  # Linear quantization
    # or
    quantize="palettize"  # Palettization (better compression)
)

Model Pruning

Remove unnecessary weights during fine-tuning:
from transformers import WhisperForConditionalGeneration
import torch

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")

# Apply structured pruning
from torch.nn.utils import prune

for module in model.modules():
    if isinstance(module, torch.nn.Linear):
        prune.l1_unstructured(module, name="weight", amount=0.3)

Knowledge Distillation

Create smaller models from larger ones:
from whisperkit.distillation import distill_model

teacher_model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v3")
student_model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")

distill_model(
    teacher=teacher_model,
    student=student_model,
    train_dataset=train_dataset,
    temperature=2.0,
    alpha=0.5
)

Testing Custom Models

CLI Testing

swift run whisperkit-cli transcribe \
  --model-path "models/custom-whisper-medical" \
  --audio-path "test_audio.wav" \
  --verbose

Programmatic Testing

import WhisperKit

Task {
    let config = WhisperKitConfig(
        modelFolder: "models/custom-whisper-medical",
        verbose: true
    )
    
    let pipe = try await WhisperKit(config)
    
    // Test multiple files
    let testFiles = [
        "test1.wav",
        "test2.wav",
        "test3.wav"
    ]
    
    for file in testFiles {
        let result = try await pipe.transcribe(audioPath: file)
        print("\(file): \(result?.text ?? "Failed")")
    }
}

Benchmarking

Compare your custom model against baselines:
import WhisperKit

func benchmark(modelPath: String, testFiles: [String]) async throws {
    let config = WhisperKitConfig(modelFolder: modelPath)
    let pipe = try await WhisperKit(config)
    
    var totalTime: Double = 0
    var totalTokens: Int = 0
    
    for file in testFiles {
        let start = Date()
        let result = try await pipe.transcribe(audioPath: file)
        let elapsed = Date().timeIntervalSince(start)
        
        totalTime += elapsed
        totalTokens += result?.segments.reduce(0) { $0 + $1.tokens.count } ?? 0
    }
    
    print("Tokens per second: \(Double(totalTokens) / totalTime)")
    print("Real-time factor: \(totalTime / getTotalAudioDuration(testFiles))")
}

Best Practices

Model Selection

  • Start with openai/whisper-small for fine-tuning (good balance)
  • Use large-v3 for highest accuracy, tiny for fastest inference
  • Consider distil models for production (2-3x faster)

Fine-Tuning

  • Use domain-specific data (medical, legal, technical)
  • Include background noise similar to deployment environment
  • Balance dataset across accents and speakers
  • Fine-tune for 2-5 epochs to avoid overfitting

Optimization

  • Apply quantization for models > 200MB
  • Target cpuAndNeuralEngine for macOS 14+ deployment
  • Use cpuAndGPU for older macOS versions
  • Test on target devices before deployment

Validation

  • Measure Word Error Rate (WER) on held-out test set
  • Test edge cases (accents, noise, domain terms)
  • Compare against baseline OpenAI models
  • Profile memory usage and inference time

Example: Medical Transcription Model

Complete workflow for creating a medical transcription model:
# 1. Prepare dataset
from datasets import load_dataset

dataset = load_dataset("medical-transcriptions", split="train")
train_test = dataset.train_test_split(test_size=0.1)

# 2. Fine-tune model
from transformers import WhisperForConditionalGeneration, Seq2SeqTrainer

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
# ... training code ...
model.save_pretrained("./whisper-medical")

# 3. Convert to CoreML
from whisperkit.convert import convert_whisper_to_coreml

convert_whisper_to_coreml(
    model_path="./whisper-medical",
    output_dir="./models/whisper-medical-coreml",
    compute_units="cpuAndNeuralEngine",
    quantize="linear"
)

# 4. Upload to Hub
from huggingface_hub import HfApi

api = HfApi()
api.upload_folder(
    folder_path="./models/whisper-medical-coreml",
    repo_id="username/whisper-medical",
    repo_type="model"
)
// 5. Deploy in app
import WhisperKit

let config = WhisperKitConfig(
    model: "whisper-medical",
    modelRepo: "username/whisper-medical"
)

let pipe = try await WhisperKit(config)
let result = try await pipe.transcribe(audioPath: "patient_note.wav")
print(result?.text ?? "")

Resources

WhisperKit Tools

Python toolkit for model conversion

Model Hub

Pre-converted WhisperKit models

Hugging Face Whisper

Browse available Whisper models

Fine-tuning Guide

Official Whisper fine-tuning guide

Next Steps

Performance Optimization

Optimize custom model performance

Memory Management

Manage memory for large models

Build docs developers (and LLMs) love