Custom Models - WhisperKit

WhisperKit supports loading custom fine-tuned Whisper models, allowing you to deploy specialized models optimized for your specific use case, domain, or language.

Model Requirements

Custom models must be converted to CoreML format compatible with WhisperKit. The models consist of:

Audio Encoder

Mel spectrogram to embeddings

Text Decoder

Embeddings to text tokens

Tokenizer

Text encoding/decoding

WhisperKit Tools

The whisperkittools Python package provides utilities to:

Convert Hugging Face Whisper models to CoreML
Fine-tune models on custom datasets
Optimize models for specific Apple devices
Deploy models to Hugging Face Hub

Installation

pip install git+https://github.com/argmaxinc/whisperkittools.git

Converting Models

From Hugging Face Hub

Convert any Whisper model from Hugging Face:

from whisperkit.convert import convert_whisper_to_coreml

convert_whisper_to_coreml(
    model_name="openai/whisper-large-v3",
    output_dir="./models",
    compute_units="cpuAndNeuralEngine"
)

From Local Checkpoint

Convert a locally fine-tuned model:

from whisperkit.convert import convert_whisper_to_coreml

convert_whisper_to_coreml(
    model_path="./my-finetuned-whisper",
    output_dir="./models",
    model_name="custom-whisper-medical",
    compute_units="cpuAndNeuralEngine"
)

Conversion Options

model_name

string

Hugging Face model ID (e.g., openai/whisper-large-v3)

model_path

string

Path to local model checkpoint

output_dir

string

Directory to save converted models

compute_units

string

default:"cpuAndNeuralEngine"

Target compute units: cpuOnly, cpuAndGPU, cpuAndNeuralEngine, all

quantize

string

Quantization mode: linear, palettize, or none

Fine-Tuning Models

Preparing Your Dataset

Dataset should be in Hugging Face Datasets format with audio and transcription:

from datasets import Dataset, Audio

dataset = Dataset.from_dict({
    "audio": ["audio1.wav", "audio2.wav"],
    "text": ["Transcription one.", "Transcription two."]
}).cast_column("audio", Audio(sampling_rate=16000))

Training Example

from transformers import WhisperForConditionalGeneration, WhisperProcessor
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

# Load base model
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
processor = WhisperProcessor.from_pretrained("openai/whisper-small")

# Configure training
training_args = Seq2SeqTrainingArguments(
    output_dir="./whisper-finetuned",
    per_device_train_batch_size=16,
    gradient_accumulation_steps=2,
    learning_rate=1e-5,
    num_train_epochs=3,
    fp16=True,
    evaluation_strategy="steps",
    save_strategy="steps",
    save_steps=500,
    eval_steps=500,
    logging_steps=100,
)

# Train model
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=processor.feature_extractor,
)

trainer.train()

Deploying to Hugging Face

After converting your model to CoreML, upload to Hugging Face Hub:

from huggingface_hub import HfApi

api = HfApi()
api.upload_folder(
    folder_path="./models/custom-whisper-medical",
    repo_id="username/custom-whisper-medical",
    repo_type="model",
)

Loading Custom Models

From Hugging Face Hub

Once uploaded, load your custom model in WhisperKit:

import WhisperKit

let config = WhisperKitConfig(
    model: "custom-whisper-medical",
    modelRepo: "username/custom-whisper-medical"
)

let pipe = try await WhisperKit(config)
let result = try await pipe.transcribe(audioPath: "patient_recording.wav")

From Local Path

Load models from local filesystem:

import WhisperKit

let config = WhisperKitConfig(
    modelFolder: "/path/to/models/custom-whisper-medical"
)

let pipe = try await WhisperKit(config)

With Compute Options

import WhisperKit
import CoreML

let computeOptions = ModelComputeOptions(
    melCompute: .cpuAndGPU,
    audioEncoderCompute: .cpuAndNeuralEngine,
    textDecoderCompute: .cpuAndNeuralEngine,
    prefillCompute: .cpuOnly
)

let config = WhisperKitConfig(
    model: "custom-whisper-medical",
    modelRepo: "username/custom-whisper-medical",
    computeOptions: computeOptions
)

let pipe = try await WhisperKit(config)

Model Repository Structure

Your Hugging Face repository should follow this structure:

username/custom-whisper-medical/
├── AudioEncoder.mlmodelc/
│   └── (CoreML compiled model)
├── TextDecoder.mlmodelc/
│   └── (CoreML compiled model)
├── MelSpectrogram.mlmodelc/
│   └── (CoreML compiled model)
├── generation_config.json
├── config.json
├── tokenizer.json
├── merges.txt
├── vocab.json
└── README.md

Model Variants

WhisperKit supports glob patterns for model selection:

// Select any distil large-v3 variant
let config = WhisperKitConfig(
    model: "distil*large-v3",
    modelRepo: "argmaxinc/whisperkit-coreml"
)

Common prefixes:

openai_whisper-* - Original OpenAI models
distil-whisper-* - Distilled models (faster, slightly lower accuracy)

Optimization Techniques

Quantization

Reduce model size and improve inference speed:

from whisperkit.convert import convert_whisper_to_coreml

convert_whisper_to_coreml(
    model_name="openai/whisper-large-v3",
    output_dir="./models",
    quantize="linear",  # Linear quantization
    # or
    quantize="palettize"  # Palettization (better compression)
)

Model Pruning

Remove unnecessary weights during fine-tuning:

from transformers import WhisperForConditionalGeneration
import torch

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")

# Apply structured pruning
from torch.nn.utils import prune

for module in model.modules():
    if isinstance(module, torch.nn.Linear):
        prune.l1_unstructured(module, name="weight", amount=0.3)

Knowledge Distillation

Create smaller models from larger ones:

from whisperkit.distillation import distill_model

teacher_model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v3")
student_model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")

distill_model(
    teacher=teacher_model,
    student=student_model,
    train_dataset=train_dataset,
    temperature=2.0,
    alpha=0.5
)

Testing Custom Models

CLI Testing

swift run whisperkit-cli transcribe \
  --model-path "models/custom-whisper-medical" \
  --audio-path "test_audio.wav" \
  --verbose

Programmatic Testing

import WhisperKit

Task {
    let config = WhisperKitConfig(
        modelFolder: "models/custom-whisper-medical",
        verbose: true
    )
    
    let pipe = try await WhisperKit(config)
    
    // Test multiple files
    let testFiles = [
        "test1.wav",
        "test2.wav",
        "test3.wav"
    ]
    
    for file in testFiles {
        let result = try await pipe.transcribe(audioPath: file)
        print("\(file): \(result?.text ?? "Failed")")
    }
}

Benchmarking

Compare your custom model against baselines:

import WhisperKit

func benchmark(modelPath: String, testFiles: [String]) async throws {
    let config = WhisperKitConfig(modelFolder: modelPath)
    let pipe = try await WhisperKit(config)
    
    var totalTime: Double = 0
    var totalTokens: Int = 0
    
    for file in testFiles {
        let start = Date()
        let result = try await pipe.transcribe(audioPath: file)
        let elapsed = Date().timeIntervalSince(start)
        
        totalTime += elapsed
        totalTokens += result?.segments.reduce(0) { $0 + $1.tokens.count } ?? 0
    }
    
    print("Tokens per second: \(Double(totalTokens) / totalTime)")
    print("Real-time factor: \(totalTime / getTotalAudioDuration(testFiles))")
}

Best Practices

Model Selection

Start with openai/whisper-small for fine-tuning (good balance)
Use large-v3 for highest accuracy, tiny for fastest inference
Consider distil models for production (2-3x faster)

Fine-Tuning

Use domain-specific data (medical, legal, technical)
Include background noise similar to deployment environment
Balance dataset across accents and speakers
Fine-tune for 2-5 epochs to avoid overfitting

Optimization

Apply quantization for models > 200MB
Target cpuAndNeuralEngine for macOS 14+ deployment
Use cpuAndGPU for older macOS versions
Test on target devices before deployment

Validation

Measure Word Error Rate (WER) on held-out test set
Test edge cases (accents, noise, domain terms)
Compare against baseline OpenAI models
Profile memory usage and inference time

Example: Medical Transcription Model

Complete workflow for creating a medical transcription model:

# 1. Prepare dataset
from datasets import load_dataset

dataset = load_dataset("medical-transcriptions", split="train")
train_test = dataset.train_test_split(test_size=0.1)

# 2. Fine-tune model
from transformers import WhisperForConditionalGeneration, Seq2SeqTrainer

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
# ... training code ...
model.save_pretrained("./whisper-medical")

# 3. Convert to CoreML
from whisperkit.convert import convert_whisper_to_coreml

convert_whisper_to_coreml(
    model_path="./whisper-medical",
    output_dir="./models/whisper-medical-coreml",
    compute_units="cpuAndNeuralEngine",
    quantize="linear"
)

# 4. Upload to Hub
from huggingface_hub import HfApi

api = HfApi()
api.upload_folder(
    folder_path="./models/whisper-medical-coreml",
    repo_id="username/whisper-medical",
    repo_type="model"
)

// 5. Deploy in app
import WhisperKit

let config = WhisperKitConfig(
    model: "whisper-medical",
    modelRepo: "username/whisper-medical"
)

let pipe = try await WhisperKit(config)
let result = try await pipe.transcribe(audioPath: "patient_note.wav")
print(result?.text ?? "")

Resources

WhisperKit Tools

Python toolkit for model conversion

Model Hub

Pre-converted WhisperKit models

Hugging Face Whisper

Browse available Whisper models

Fine-tuning Guide

Official Whisper fine-tuning guide

Next Steps

Performance Optimization

Optimize custom model performance

Memory Management

Manage memory for large models

Get Started

WhisperKit (Speech-to-Text)

TTSKit (Text-to-Speech)

Advanced

Examples

Documentation Index

​Model Requirements

Audio Encoder

Text Decoder

Tokenizer

​WhisperKit Tools

​Installation

​Converting Models

​From Hugging Face Hub

​From Local Checkpoint

​Conversion Options

​Fine-Tuning Models

​Preparing Your Dataset

​Training Example

​Deploying to Hugging Face

​Loading Custom Models

​From Hugging Face Hub

​From Local Path

​With Compute Options

​Model Repository Structure

​Model Variants

​Optimization Techniques

​Quantization

​Model Pruning

​Knowledge Distillation

​Testing Custom Models

​CLI Testing

​Programmatic Testing

​Benchmarking

​Best Practices

Model Selection

Fine-Tuning

Optimization

Validation

​Example: Medical Transcription Model

​Resources

WhisperKit Tools

Model Hub

Hugging Face Whisper

Fine-tuning Guide

​Next Steps

Performance Optimization

Memory Management

Build docs developers (and LLMs) love

Model Requirements

WhisperKit Tools

Installation

Converting Models

From Hugging Face Hub

From Local Checkpoint

Conversion Options

Fine-Tuning Models

Preparing Your Dataset

Training Example

Deploying to Hugging Face

Loading Custom Models

From Hugging Face Hub

From Local Path

With Compute Options

Model Repository Structure

Model Variants

Optimization Techniques

Quantization

Model Pruning

Knowledge Distillation

Testing Custom Models

CLI Testing

Programmatic Testing

Benchmarking

Best Practices

Example: Medical Transcription Model

Resources

Next Steps