Overview
WhisperKit includes a local server that implements the OpenAI Audio API, allowing you to use existing OpenAI SDK clients or generate new ones. The server supports transcription and translation with output streaming capabilities.For full-duplex real-time streaming, check out WhisperKit Pro Local Server which provides live audio streaming.
Building the Server
First, build the CLI with server support:Starting the Server
Default Configuration
- Listen on
localhost:50060 - Use the default
tinymodel - Download the model if not already available
Custom Configuration
API Endpoints
The server exposes two main endpoints:- POST
/v1/audio/transcriptions- Transcribe audio to text - POST
/v1/audio/translations- Translate audio to English
Supported Parameters
| Parameter | Description | Default |
|---|---|---|
file | Audio file (wav, mp3, m4a, flac) | Required |
model | Model identifier | Server default |
language | Source language code (e.g., “en”, “es”) | Auto-detect |
prompt | Text to guide transcription | None |
response_format | Output format: json, verbose_json | verbose_json |
temperature | Sampling temperature (0.0-1.0) | 0.0 |
timestamp_granularities[] | Timing detail: word, segment | segment |
stream | Enable streaming output | false |
include[] | Include additional data: logprobs | None |
Python Client
Use the OpenAI Python SDK to connect to the local server:Installation
Quick Example
Transcription with Options
Translation
Streaming Transcription
Command Line Usage
Swift Client
The Swift client is generated from the OpenAPI specification:Installation
Command Line Usage
Programmatic Usage
cURL Client
Use the provided shell scripts or raw cURL commands:Using Shell Scripts
Raw cURL Commands
Basic Transcription
With Word Timestamps
Streaming Output
Translation
With Log Probabilities
JavaScript/TypeScript Client
Installation
Usage
Generating Custom Clients
You can generate clients for any language using the OpenAPI specification:Get the OpenAPI Spec
Generate Clients
Python Client
TypeScript Client
Go Client
API Limitations
Compared to the official OpenAI API:- Response formats: Only
jsonandverbose_jsonsupported (no plain text, SRT, VTT) - Model selection: Server must be launched with desired model via
--modelflag
Fully Supported Features
The local server fully supports:- Log probabilities:
include[]=logprobsparameter for token-level confidence - Streaming responses: Server-Sent Events (SSE) for real-time transcription
- Timestamp granularities: Both
wordandsegmentlevel timing - Language detection: Automatic language detection or manual specification
- Temperature control: Sampling temperature for transcription randomness
- Prompt text: Text guidance for transcription style and context
Server Configuration
Environment Variables
Model Management
Docker Deployment
Create aDockerfile:
Next Steps
Basic Transcription
Learn the basics of file transcription
Real-Time Streaming
Transcribe audio in real-time