Parakeet MLX supports multiple output formats for different use cases, from plain text to detailed JSON with word-level timestamps.
TXT Plain text transcription without timestamps
SRT SubRip subtitle format with timestamps
VTT WebVTT subtitle format for web videos
JSON Structured data with full timing and confidence
Quick Reference
CLI
# Single format
parakeet-mlx audio.mp3 --output-format txt
parakeet-mlx audio.mp3 --output-format srt
parakeet-mlx audio.mp3 --output-format vtt
parakeet-mlx audio.mp3 --output-format json
# All formats at once
parakeet-mlx audio.mp3 --output-format all
Python API
from parakeet_mlx import from_pretrained
from parakeet_mlx.cli import to_txt, to_srt, to_vtt, to_json
model = from_pretrained( "mlx-community/parakeet-tdt-0.6b-v3" )
result = model.transcribe( "audio.wav" )
# Format the result
text = to_txt(result)
srt = to_srt(result)
vtt = to_vtt(result)
json_str = to_json(result)
Plain text format containing just the transcribed text.
Example Output
Hello world. This is a test transcription. It contains multiple sentences.
CLI Usage
parakeet-mlx audio.mp3 --output-format txt
# Creates: audio.txt
Python Usage
from parakeet_mlx import from_pretrained
from parakeet_mlx.cli import to_txt
model = from_pretrained( "mlx-community/parakeet-tdt-0.6b-v3" )
result = model.transcribe( "audio.wav" )
# Convert to plain text
text = to_txt(result)
print (text)
# Output: "Hello world. This is a test transcription."
# Save to file
with open ( "transcript.txt" , "w" , encoding = "utf-8" ) as f:
f.write(text)
Implementation
From the source code (parakeet_mlx/cli.py:47-49):
def to_txt ( result : AlignedResult) -> str :
"""Format transcription result as plain text."""
return result.text.strip()
TXT format strips leading/trailing whitespace but preserves internal formatting.
SubRip (SRT) format for video subtitles with timestamps.
Example Output
Sentence-level
Word-level
1
00:00:00,000 --> 00:00:02,150
Hello world.
2
00:00:02,150 --> 00:00:05,320
This is a test transcription.
3
00:00:05,320 --> 00:00:08,100
It contains multiple sentences.
1
00:00:00,000 --> 00:00:00,420
<u>Hello</u> world.
2
00:00:00,420 --> 00:00:00,950
Hello <u>world</u>.
3
00:00:02,150 --> 00:00:02,450
<u>This</u> is a test transcription.
4
00:00:02,450 --> 00:00:02,680
This <u>is</u> a test transcription.
CLI Usage
# Sentence-level timestamps (default)
parakeet-mlx audio.mp3 --output-format srt
# Word-level timestamps
parakeet-mlx audio.mp3 --output-format srt --highlight-words
Python Usage
from parakeet_mlx import from_pretrained
from parakeet_mlx.cli import to_srt
model = from_pretrained( "mlx-community/parakeet-tdt-0.6b-v3" )
result = model.transcribe( "audio.wav" )
# Sentence-level SRT
srt_content = to_srt(result, highlight_words = False )
# Word-level SRT
srt_word_level = to_srt(result, highlight_words = True )
# Save to file
with open ( "subtitles.srt" , "w" , encoding = "utf-8" ) as f:
f.write(srt_content)
Sequential number starting from 1
Format: HH:MM:SS,mmm --> HH:MM:SS,mmm (comma as decimal separator)
Subtitle text. For word-level, currently spoken word is wrapped in <u></u> tags.
Implementation Details
From the source code (parakeet_mlx/cli.py:52-97):
def to_srt ( result : AlignedResult, highlight_words : bool = False ) -> str :
"""Format transcription result as an SRT file."""
srt_content = []
entry_index = 1
if highlight_words:
# Word-level: Each word gets its own entry with highlighting
for sentence in result.sentences:
for i, token in enumerate (sentence.tokens):
start_time = format_timestamp(token.start, decimal_marker = "," )
end_time = format_timestamp(
token.end if token == sentence.tokens[ - 1 ]
else sentence.tokens[i + 1 ].start,
decimal_marker = "," ,
)
# Build text with current word underlined
text = ""
for j, inner_token in enumerate (sentence.tokens):
if i == j:
text += inner_token.text.replace(
inner_token.text.strip(),
f "<u> { inner_token.text.strip() } </u>" ,
)
else :
text += inner_token.text
srt_content.extend([
str (entry_index),
f " { start_time } --> { end_time } " ,
text.strip(),
"" ,
])
entry_index += 1
else :
# Sentence-level: Each sentence gets one entry
for sentence in result.sentences:
start_time = format_timestamp(sentence.start, decimal_marker = "," )
end_time = format_timestamp(sentence.end, decimal_marker = "," )
srt_content.extend([
str (entry_index),
f " { start_time } --> { end_time } " ,
sentence.text.strip(),
"" ,
])
entry_index += 1
return " \n " .join(srt_content)
WebVTT format for web-based video players.
Example Output
Sentence-level
Word-level
WEBVTT
00:00:00.000 --> 00:00:02.150
Hello world.
00:00:02.150 --> 00:00:05.320
This is a test transcription.
00:00:05.320 --> 00:00:08.100
It contains multiple sentences.
WEBVTT
00:00:00.000 --> 00:00:00.420
<b>Hello</b> world.
00:00:00.420 --> 00:00:00.950
Hello <b>world</b>.
00:00:02.150 --> 00:00:02.450
<b>This</b> is a test transcription.
00:00:02.450 --> 00:00:02.680
This <b>is</b> a test transcription.
CLI Usage
# Sentence-level timestamps (default)
parakeet-mlx audio.mp3 --output-format vtt
# Word-level timestamps
parakeet-mlx audio.mp3 --output-format vtt --highlight-words
Python Usage
from parakeet_mlx import from_pretrained
from parakeet_mlx.cli import to_vtt
model = from_pretrained( "mlx-community/parakeet-tdt-0.6b-v3" )
result = model.transcribe( "audio.wav" )
# Sentence-level VTT
vtt_content = to_vtt(result, highlight_words = False )
# Word-level VTT
vtt_word_level = to_vtt(result, highlight_words = True )
# Save to file
with open ( "subtitles.vtt" , "w" , encoding = "utf-8" ) as f:
f.write(vtt_content)
Differences from SRT
Feature SRT VTT Header None WEBVTTDecimal Comma (,) Period (.) Word highlight <u>word</u><b>word</b>Use case Desktop players Web browsers
Implementation
From the source code (parakeet_mlx/cli.py:100-140):
def to_vtt ( result : AlignedResult, highlight_words : bool = False ) -> str :
"""Format transcription result as a VTT file."""
vtt_content = [ "WEBVTT" , "" ] # VTT header
if highlight_words:
# Word-level with bold highlighting
for sentence in result.sentences:
for i, token in enumerate (sentence.tokens):
start_time = format_timestamp(token.start, decimal_marker = "." )
end_time = format_timestamp(
token.end if token == sentence.tokens[ - 1 ]
else sentence.tokens[i + 1 ].start,
decimal_marker = "." ,
)
text_line = ""
for j, inner_token in enumerate (sentence.tokens):
if i == j:
text_line += inner_token.text.replace(
inner_token.text.strip(),
f "<b> { inner_token.text.strip() } </b>" ,
)
else :
text_line += inner_token.text
vtt_content.extend([
f " { start_time } --> { end_time } " ,
text_line.strip(),
"" ,
])
else :
# Sentence-level
for sentence in result.sentences:
start_time = format_timestamp(sentence.start, decimal_marker = "." )
end_time = format_timestamp(sentence.end, decimal_marker = "." )
vtt_content.extend([
f " { start_time } --> { end_time } " ,
sentence.text.strip(),
"" ,
])
return " \n " .join(vtt_content)
Structured JSON format with complete timing and confidence information.
Example Output
{
"text" : "Hello world. This is a test." ,
"sentences" : [
{
"text" : "Hello world." ,
"start" : 0.0 ,
"end" : 1.95 ,
"duration" : 1.95 ,
"confidence" : 0.943 ,
"tokens" : [
{
"text" : "Hello" ,
"start" : 0.0 ,
"end" : 0.42 ,
"duration" : 0.42 ,
"confidence" : 0.956
},
{
"text" : " world" ,
"start" : 0.42 ,
"end" : 0.95 ,
"duration" : 0.53 ,
"confidence" : 0.931
},
{
"text" : "." ,
"start" : 0.95 ,
"end" : 1.95 ,
"duration" : 1.0 ,
"confidence" : 0.942
}
]
},
{
"text" : "This is a test." ,
"start" : 1.95 ,
"end" : 4.8 ,
"duration" : 2.85 ,
"confidence" : 0.962 ,
"tokens" : [
{
"text" : " This" ,
"start" : 1.95 ,
"end" : 2.31 ,
"duration" : 0.36 ,
"confidence" : 0.978
},
{
"text" : " is" ,
"start" : 2.31 ,
"end" : 2.58 ,
"duration" : 0.27 ,
"confidence" : 0.965
},
{
"text" : " a" ,
"start" : 2.58 ,
"end" : 2.73 ,
"duration" : 0.15 ,
"confidence" : 0.941
},
{
"text" : " test" ,
"start" : 2.73 ,
"end" : 3.42 ,
"duration" : 0.69 ,
"confidence" : 0.953
},
{
"text" : "." ,
"start" : 3.42 ,
"end" : 4.8 ,
"duration" : 1.38 ,
"confidence" : 0.974
}
]
}
]
}
CLI Usage
parakeet-mlx audio.mp3 --output-format json
# Creates: audio.json
Python Usage
import json
from parakeet_mlx import from_pretrained
from parakeet_mlx.cli import to_json
model = from_pretrained( "mlx-community/parakeet-tdt-0.6b-v3" )
result = model.transcribe( "audio.wav" )
# Convert to JSON string
json_str = to_json(result)
# Parse as dictionary
data = json.loads(json_str)
print (data[ "text" ])
print ( f "Number of sentences: { len (data[ 'sentences' ]) } " )
# Access specific fields
for sentence in data[ "sentences" ]:
print ( f "Sentence: { sentence[ 'text' ] } " )
print ( f "Duration: { sentence[ 'duration' ] :.2f} s" )
print ( f "Confidence: { sentence[ 'confidence' ] :.2%} " )
print ( f "Words: { len (sentence[ 'tokens' ]) } " )
# Save to file
with open ( "transcript.json" , "w" , encoding = "utf-8" ) as f:
f.write(json_str)
Schema
Array of sentence objects, each containing: Start time in seconds (rounded to 3 decimals)
End time in seconds (rounded to 3 decimals)
Duration in seconds (rounded to 3 decimals)
Confidence score 0-1 (rounded to 3 decimals)
Array of word/token objects
Array of token objects within each sentence: Token text (may include leading/trailing whitespace)
Start time in seconds (rounded to 3 decimals)
End time in seconds (rounded to 3 decimals)
Duration in seconds (rounded to 3 decimals)
Confidence score 0-1 (rounded to 3 decimals)
Implementation
From the source code (parakeet_mlx/cli.py:143-171):
def to_json ( result : AlignedResult) -> str :
output_dict = {
"text" : result.text,
"sentences" : [
_aligned_sentence_to_dict(sentence)
for sentence in result.sentences
],
}
return json.dumps(output_dict, indent = 2 , ensure_ascii = False )
def _aligned_sentence_to_dict ( sentence : AlignedSentence) -> Dict[ str , Any]:
return {
"text" : sentence.text,
"start" : round (sentence.start, 3 ),
"end" : round (sentence.end, 3 ),
"duration" : round (sentence.duration, 3 ),
"confidence" : round (sentence.confidence, 3 ),
"tokens" : [_aligned_token_to_dict(token) for token in sentence.tokens],
}
def _aligned_token_to_dict ( token : AlignedToken) -> Dict[ str , Any]:
return {
"text" : token.text,
"start" : round (token.start, 3 ),
"end" : round (token.end, 3 ),
"duration" : round (token.duration, 3 ),
"confidence" : round (token.confidence, 3 ),
}
CLI
parakeet-mlx audio.mp3 --output-format all
# Creates:
# - audio.txt
# - audio.srt
# - audio.vtt
# - audio.json
Python
from parakeet_mlx import from_pretrained
from parakeet_mlx.cli import to_txt, to_srt, to_vtt, to_json
from pathlib import Path
model = from_pretrained( "mlx-community/parakeet-tdt-0.6b-v3" )
result = model.transcribe( "audio.wav" )
# Generate all formats
formats = {
"txt" : to_txt(result),
"srt" : to_srt(result),
"vtt" : to_vtt(result),
"json" : to_json(result),
}
# Save all formats
output_dir = Path( "transcripts" )
output_dir.mkdir( exist_ok = True )
for ext, content in formats.items():
output_path = output_dir / f "audio. { ext } "
with open (output_path, "w" , encoding = "utf-8" ) as f:
f.write(content)
print ( f "Saved: { output_path } " )
You can also work directly with the result objects:
from parakeet_mlx import from_pretrained
model = from_pretrained( "mlx-community/parakeet-tdt-0.6b-v3" )
result = model.transcribe( "audio.wav" )
# Custom CSV format
with open ( "transcript.csv" , "w" , encoding = "utf-8" ) as f:
f.write( "start,end,text,confidence \n " )
for sentence in result.sentences:
f.write(
f " { sentence.start :.3f} ,"
f " { sentence.end :.3f} ,"
f '" { sentence.text } ",'
f " { sentence.confidence :.3f} \n "
)
# Custom markdown format
with open ( "transcript.md" , "w" , encoding = "utf-8" ) as f:
f.write( "# Transcription \n\n " )
for i, sentence in enumerate (result.sentences, 1 ):
f.write(
f "**[ { sentence.start :.1f} s - { sentence.end :.1f} s]** "
f " { sentence.text } \n\n "
)
# Custom HTML format
with open ( "transcript.html" , "w" , encoding = "utf-8" ) as f:
f.write( "<!DOCTYPE html><html><body> \n " )
f.write( "<h1>Transcription</h1> \n " )
for sentence in result.sentences:
f.write(
f '<p data-start=" { sentence.start } " data-end=" { sentence.end } ">'
f ' { sentence.text } </p> \n '
)
f.write( "</body></html> \n " )
Format Use Case Timestamps Confidence File Size TXT Plain text transcripts ❌ ❌ Smallest SRT Video subtitles (desktop) ✅ Sentence or word ❌ Small VTT Video subtitles (web) ✅ Sentence or word ❌ Small JSON Programmatic access ✅ Full detail ✅ Largest
Next Steps
Python API Learn how to use the formatting functions
CLI Usage Command-line output options
Chunking Process long audio files
Streaming Real-time transcription