Skip to main content

POST /v1/audio/transcriptions

Transcribes an audio file into text. The request body must be sent as multipart/form-data with the audio file included as a form field. This endpoint maps to the createTranscription operation and is compatible with the OpenAI Whisper API.

Request headers

x-portkey-provider
string
The provider to route the request to (e.g. openai). Required when not using a config.
x-portkey-api-key
string
Your provider API key.
x-portkey-config
string
A JSON config object or config ID that defines routing, fallbacks, retries, and more.
x-portkey-virtual-key
string
A virtual key ID from Portkey Cloud.

Request body

This endpoint accepts multipart/form-data. The audio file must be uploaded as a file field named file.
file
file
required
The audio file to transcribe. Supported formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm. Maximum file size is 25 MB.
model
string
required
The speech-to-text model to use (e.g. whisper-1, gpt-4o-transcribe).
language
string
The language of the audio, in ISO-639-1 format (e.g. en, fr, de). Providing the language improves accuracy and latency.
prompt
string
Optional text to guide the model’s style or provide context. The prompt should match the language of the audio.
response_format
string
default:"json"
The format of the transcription output. One of json, text, srt, verbose_json, or vtt.
temperature
number
default:"0"
Sampling temperature between 0 and 1. Higher values produce more varied output. Set to 0 for deterministic transcription.
timestamp_granularities
string[]
The granularity of timestamps to include. Requires response_format to be verbose_json. Accepts word and/or segment.

Response

The response format depends on the response_format parameter. json (default)
text
string
The transcribed text.
verbose_json
text
string
The full transcribed text.
language
string
The detected language of the audio.
duration
number
The duration of the audio file in seconds.
segments
object[]
Segment-level transcription data, present when timestamp_granularities includes segment.
words
object[]
Word-level transcription data, present when timestamp_granularities includes word.
text, srt, vtt — Plain-text or subtitle format strings.

Code examples

curl http://localhost:8787/v1/audio/transcriptions \
  -H "x-portkey-provider: openai" \
  -H "x-portkey-api-key: $OPENAI_API_KEY" \
  -F "model=whisper-1" \
  -F "[email protected]" \
  -F "language=en" \
  -F "response_format=json"

Build docs developers (and LLMs) love