Documentation Index Fetch the complete documentation index at: https://mintlify.com/argmaxinc/WhisperKit/llms.txt
Use this file to discover all available pages before exploring further.
Overview
WhisperKit includes a local server that implements the OpenAI Audio API, allowing you to use existing OpenAI SDK clients or generate new ones. The server supports transcription and translation with output streaming capabilities.
Building the Server
First, build the CLI with server support:
# Clone the repository if you haven't already
git clone https://github.com/argmaxinc/whisperkit.git
cd whisperkit
# Build with server support
make build-local-server
# Or manually with the build flag
BUILD_ALL = 1 swift build --product whisperkit-cli
Starting the Server
Default Configuration
# Start server on default port (50060)
BUILD_ALL = 1 swift run whisperkit-cli serve
The server will:
Listen on localhost:50060
Use the default tiny model
Download the model if not already available
Custom Configuration
# Custom host and port
BUILD_ALL = 1 swift run whisperkit-cli serve \
--host 0.0.0.0 \
--port 8080
# With specific model
BUILD_ALL = 1 swift run whisperkit-cli serve \
--model base \
--verbose
# See all options
BUILD_ALL = 1 swift run whisperkit-cli serve --help
API Endpoints
The server exposes two main endpoints:
POST /v1/audio/transcriptions - Transcribe audio to text
POST /v1/audio/translations - Translate audio to English
Supported Parameters
Parameter Description Default fileAudio file (wav, mp3, m4a, flac) Required modelModel identifier Server default languageSource language code (e.g., “en”, “es”) Auto-detect promptText to guide transcription None response_formatOutput format: json, verbose_json verbose_jsontemperatureSampling temperature (0.0-1.0) 0.0 timestamp_granularities[]Timing detail: word, segment segmentstreamEnable streaming output falseinclude[]Include additional data: logprobs None
Python Client
Use the OpenAI Python SDK to connect to the local server:
Installation
cd Examples/ServeCLIClient/Python
uv sync # or: pip install openai
Quick Example
from openai import OpenAI
# Connect to local server
client = OpenAI( base_url = "http://localhost:50060/v1" )
# Transcribe audio file
with open ( "audio.wav" , "rb" ) as audio_file:
result = client.audio.transcriptions.create(
file = audio_file,
model = "tiny" # Model parameter is required
)
print (result.text)
Transcription with Options
from openai import OpenAI
client = OpenAI( base_url = "http://localhost:50060/v1" )
with open ( "audio.wav" , "rb" ) as audio_file:
result = client.audio.transcriptions.create(
file = audio_file,
model = "tiny" ,
language = "en" ,
response_format = "verbose_json" ,
timestamp_granularities = [ "word" , "segment" ]
)
# Access detailed information
print ( f "Language: { result.language } " )
print ( f "Duration: { result.duration } s" )
print ( f "Text: { result.text } " )
# Word-level timestamps
for word in result.words:
print ( f " { word.start :.2f} s - { word.end :.2f} s: { word.word } " )
# Segment-level timestamps
for segment in result.segments:
print ( f "[ { segment.start :.2f} s]: { segment.text } " )
Translation
# Translate audio to English
with open ( "spanish_audio.wav" , "rb" ) as audio_file:
result = client.audio.translations.create(
file = audio_file,
model = "tiny"
)
print ( f "Translation: { result.text } " )
Streaming Transcription
import requests
import json
# Use requests library for streaming
url = "http://localhost:50060/v1/audio/transcriptions"
with open ( "audio.wav" , "rb" ) as audio_file:
files = { "file" : audio_file}
data = {
"model" : "tiny" ,
"stream" : "true" ,
"response_format" : "verbose_json"
}
response = requests.post(
url,
files = files,
data = data,
headers = { "Accept" : "text/event-stream" },
stream = True
)
# Process Server-Sent Events
for line in response.iter_lines():
if line:
line_str = line.decode( 'utf-8' )
if line_str.startswith( 'data: ' ):
data_str = line_str[ 6 :] # Remove 'data: ' prefix
try :
event = json.loads(data_str)
if event.get( 'type' ) == 'transcript.text.delta' :
print (event[ 'delta' ], end = '' , flush = True )
elif event.get( 'type' ) == 'transcript.text.done' :
print ( f " \n Final: { event[ 'text' ] } " )
except json.JSONDecodeError:
pass
Command Line Usage
# Using the provided Python client
cd Examples/ServeCLIClient/Python
# Transcribe
python whisperkit_client.py transcribe \
--file audio.wav \
--language en
# Translate
python whisperkit_client.py translate \
--file audio.wav
# With streaming
python whisperkit_client.py transcribe \
--file audio.wav \
--stream
Swift Client
The Swift client is generated from the OpenAPI specification:
Installation
cd Examples/ServeCLIClient/Swift
swift build
Command Line Usage
# Transcribe
swift run whisperkit-client transcribe audio.wav --language en
# Translate
swift run whisperkit-client translate audio.wav
# With word-level timestamps
swift run whisperkit-client transcribe audio.wav \
--timestamp-granularities word,segment \
--response-format verbose_json
# Streaming
swift run whisperkit-client transcribe audio.wav --stream
Programmatic Usage
import Foundation
import WhisperKitSwiftClient
// Initialize client
let client = WhisperKitClient (
serverURL : "http://localhost:50060/v1"
)
// Transcribe audio
try await client. transcribeAudio (
filePath : "audio.wav" ,
language : "en" ,
model : "tiny" ,
responseFormat : "verbose_json" ,
timestampGranularities : "word,segment" ,
stream : false
)
// Translate audio
try await client. translateAudio (
filePath : "audio.wav" ,
language : "es" ,
model : "tiny" ,
responseFormat : "verbose_json"
)
cURL Client
Use the provided shell scripts or raw cURL commands:
Using Shell Scripts
cd Examples/ServeCLIClient/Curl
chmod +x * .sh
# Transcribe
./transcribe.sh audio.wav --language en
# Translate
./translate.sh audio.wav --language es
# With all options
./transcribe.sh audio.wav \
--model base \
--language en \
--timestamp-granularities word,segment \
--stream true
# Run comprehensive test suite
./test.sh
Raw cURL Commands
Basic Transcription
curl -X POST http://localhost:50060/v1/audio/transcriptions \
-F file=@audio.wav \
-F model="tiny" \
-F response_format="verbose_json"
With Word Timestamps
curl -X POST http://localhost:50060/v1/audio/transcriptions \
-F file=@audio.wav \
-F model="tiny" \
-F response_format="verbose_json" \
-F timestamp_granularities[]="word,segment"
Streaming Output
curl -N -X POST http://localhost:50060/v1/audio/transcriptions \
-F file=@audio.wav \
-F model="tiny" \
-F stream="true" \
-H "Accept: text/event-stream"
Translation
curl -X POST http://localhost:50060/v1/audio/translations \
-F file=@audio.wav \
-F model="tiny" \
-F response_format="verbose_json"
With Log Probabilities
curl -X POST http://localhost:50060/v1/audio/transcriptions \
-F file=@audio.wav \
-F model="tiny" \
-F response_format="json" \
-F "include[]=logprobs"
JavaScript/TypeScript Client
Installation
npm install openai
# or
yarn add openai
Usage
import OpenAI from 'openai' ;
import fs from 'fs' ;
const client = new OpenAI ({
baseURL: 'http://localhost:50060/v1' ,
apiKey: 'dummy-key' // Not used by local server
});
// Transcribe
const transcription = await client . audio . transcriptions . create ({
file: fs . createReadStream ( 'audio.wav' ),
model: 'tiny' ,
language: 'en' ,
response_format: 'verbose_json' ,
timestamp_granularities: [ 'word' , 'segment' ]
});
console . log ( transcription . text );
// Access word timestamps
transcription . words ?. forEach ( word => {
console . log ( ` ${ word . start } s: ${ word . word } ` );
});
// Translate
const translation = await client . audio . translations . create ({
file: fs . createReadStream ( 'audio.wav' ),
model: 'tiny'
});
console . log ( translation . text );
Generating Custom Clients
You can generate clients for any language using the OpenAPI specification:
Get the OpenAPI Spec
# Generate the latest spec
make generate-server
# The spec is located at:
# scripts/specs/localserver_openapi.yaml
Generate Clients
Python Client
swift run swift-openapi-generator generate \
scripts/specs/localserver_openapi.yaml \
--output-directory python-client \
--mode client \
--mode types
TypeScript Client
npx @openapitools/openapi-generator-cli generate \
-i scripts/specs/localserver_openapi.yaml \
-g typescript-fetch \
-o typescript-client
Go Client
openapi-generator-cli generate \
-i scripts/specs/localserver_openapi.yaml \
-g go \
-o go-client
API Limitations
Compared to the official OpenAI API:
Response formats : Only json and verbose_json supported (no plain text, SRT, VTT)
Model selection : Server must be launched with desired model via --model flag
Fully Supported Features
The local server fully supports:
Log probabilities : include[]=logprobs parameter for token-level confidence
Streaming responses : Server-Sent Events (SSE) for real-time transcription
Timestamp granularities : Both word and segment level timing
Language detection : Automatic language detection or manual specification
Temperature control : Sampling temperature for transcription randomness
Prompt text : Text guidance for transcription style and context
Server Configuration
Environment Variables
# Set custom model cache directory
export WHISPERKIT_CACHE_DIR = "/path/to/models"
# Enable debug logging
BUILD_ALL = 1 swift run whisperkit-cli serve --verbose
Model Management
# Download a model before starting server
make download-model MODEL=base
# Start server with downloaded model
BUILD_ALL = 1 swift run whisperkit-cli serve \
--model-path "Models/whisperkit-coreml/openai_whisper-base"
Docker Deployment
Create a Dockerfile:
FROM swift:5.9
# Install dependencies
RUN apt-get update && apt-get install -y \
git \
git-lfs
# Clone and build WhisperKit
WORKDIR /app
RUN git clone https://github.com/argmaxinc/whisperkit.git
WORKDIR /app/whisperkit
RUN make setup
RUN make download-model MODEL=tiny
RUN BUILD_ALL=1 swift build --product whisperkit-cli -c release
EXPOSE 50060
CMD [ "BUILD_ALL=1" , "swift" , "run" , "-c" , "release" , "whisperkit-cli" , "serve" , "--host" , "0.0.0.0" ]
Build and run:
# Build image
docker build -t whisperkit-server .
# Run container
docker run -p 50060:50060 whisperkit-server
Next Steps
Basic Transcription Learn the basics of file transcription
Real-Time Streaming Transcribe audio in real-time