Skip to main content
Parakeet MLX supports multiple output formats for different use cases, from plain text to detailed JSON with word-level timestamps.

Supported Formats

TXT

Plain text transcription without timestamps

SRT

SubRip subtitle format with timestamps

VTT

WebVTT subtitle format for web videos

JSON

Structured data with full timing and confidence

Quick Reference

CLI

# Single format
parakeet-mlx audio.mp3 --output-format txt
parakeet-mlx audio.mp3 --output-format srt
parakeet-mlx audio.mp3 --output-format vtt
parakeet-mlx audio.mp3 --output-format json

# All formats at once
parakeet-mlx audio.mp3 --output-format all

Python API

from parakeet_mlx import from_pretrained
from parakeet_mlx.cli import to_txt, to_srt, to_vtt, to_json

model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")
result = model.transcribe("audio.wav")

# Format the result
text = to_txt(result)
srt = to_srt(result)
vtt = to_vtt(result)
json_str = to_json(result)

TXT Format

Plain text format containing just the transcribed text.

Example Output

Hello world. This is a test transcription. It contains multiple sentences.

CLI Usage

parakeet-mlx audio.mp3 --output-format txt
# Creates: audio.txt

Python Usage

from parakeet_mlx import from_pretrained
from parakeet_mlx.cli import to_txt

model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")
result = model.transcribe("audio.wav")

# Convert to plain text
text = to_txt(result)
print(text)
# Output: "Hello world. This is a test transcription."

# Save to file
with open("transcript.txt", "w", encoding="utf-8") as f:
    f.write(text)

Implementation

From the source code (parakeet_mlx/cli.py:47-49):
def to_txt(result: AlignedResult) -> str:
    """Format transcription result as plain text."""
    return result.text.strip()
TXT format strips leading/trailing whitespace but preserves internal formatting.

SRT Format

SubRip (SRT) format for video subtitles with timestamps.

Example Output

1
00:00:00,000 --> 00:00:02,150
Hello world.

2
00:00:02,150 --> 00:00:05,320
This is a test transcription.

3
00:00:05,320 --> 00:00:08,100
It contains multiple sentences.

CLI Usage

# Sentence-level timestamps (default)
parakeet-mlx audio.mp3 --output-format srt

# Word-level timestamps
parakeet-mlx audio.mp3 --output-format srt --highlight-words

Python Usage

from parakeet_mlx import from_pretrained
from parakeet_mlx.cli import to_srt

model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")
result = model.transcribe("audio.wav")

# Sentence-level SRT
srt_content = to_srt(result, highlight_words=False)

# Word-level SRT
srt_word_level = to_srt(result, highlight_words=True)

# Save to file
with open("subtitles.srt", "w", encoding="utf-8") as f:
    f.write(srt_content)

Format Specification

Entry number
integer
Sequential number starting from 1
Timestamp
string
Format: HH:MM:SS,mmm --> HH:MM:SS,mmm (comma as decimal separator)
Text
string
Subtitle text. For word-level, currently spoken word is wrapped in <u></u> tags.

Implementation Details

From the source code (parakeet_mlx/cli.py:52-97):
def to_srt(result: AlignedResult, highlight_words: bool = False) -> str:
    """Format transcription result as an SRT file."""
    srt_content = []
    entry_index = 1
    
    if highlight_words:
        # Word-level: Each word gets its own entry with highlighting
        for sentence in result.sentences:
            for i, token in enumerate(sentence.tokens):
                start_time = format_timestamp(token.start, decimal_marker=",")
                end_time = format_timestamp(
                    token.end if token == sentence.tokens[-1]
                    else sentence.tokens[i + 1].start,
                    decimal_marker=",",
                )
                
                # Build text with current word underlined
                text = ""
                for j, inner_token in enumerate(sentence.tokens):
                    if i == j:
                        text += inner_token.text.replace(
                            inner_token.text.strip(),
                            f"<u>{inner_token.text.strip()}</u>",
                        )
                    else:
                        text += inner_token.text
                
                srt_content.extend([
                    str(entry_index),
                    f"{start_time} --> {end_time}",
                    text.strip(),
                    "",
                ])
                entry_index += 1
    else:
        # Sentence-level: Each sentence gets one entry
        for sentence in result.sentences:
            start_time = format_timestamp(sentence.start, decimal_marker=",")
            end_time = format_timestamp(sentence.end, decimal_marker=",")
            
            srt_content.extend([
                str(entry_index),
                f"{start_time} --> {end_time}",
                sentence.text.strip(),
                "",
            ])
            entry_index += 1
    
    return "\n".join(srt_content)

VTT Format

WebVTT format for web-based video players.

Example Output

WEBVTT

00:00:00.000 --> 00:00:02.150
Hello world.

00:00:02.150 --> 00:00:05.320
This is a test transcription.

00:00:05.320 --> 00:00:08.100
It contains multiple sentences.

CLI Usage

# Sentence-level timestamps (default)
parakeet-mlx audio.mp3 --output-format vtt

# Word-level timestamps
parakeet-mlx audio.mp3 --output-format vtt --highlight-words

Python Usage

from parakeet_mlx import from_pretrained
from parakeet_mlx.cli import to_vtt

model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")
result = model.transcribe("audio.wav")

# Sentence-level VTT
vtt_content = to_vtt(result, highlight_words=False)

# Word-level VTT  
vtt_word_level = to_vtt(result, highlight_words=True)

# Save to file
with open("subtitles.vtt", "w", encoding="utf-8") as f:
    f.write(vtt_content)

Differences from SRT

FeatureSRTVTT
HeaderNoneWEBVTT
DecimalComma (,)Period (.)
Word highlight<u>word</u><b>word</b>
Use caseDesktop playersWeb browsers

Implementation

From the source code (parakeet_mlx/cli.py:100-140):
def to_vtt(result: AlignedResult, highlight_words: bool = False) -> str:
    """Format transcription result as a VTT file."""
    vtt_content = ["WEBVTT", ""]  # VTT header
    
    if highlight_words:
        # Word-level with bold highlighting
        for sentence in result.sentences:
            for i, token in enumerate(sentence.tokens):
                start_time = format_timestamp(token.start, decimal_marker=".")
                end_time = format_timestamp(
                    token.end if token == sentence.tokens[-1]
                    else sentence.tokens[i + 1].start,
                    decimal_marker=".",
                )
                
                text_line = ""
                for j, inner_token in enumerate(sentence.tokens):
                    if i == j:
                        text_line += inner_token.text.replace(
                            inner_token.text.strip(),
                            f"<b>{inner_token.text.strip()}</b>",
                        )
                    else:
                        text_line += inner_token.text
                
                vtt_content.extend([
                    f"{start_time} --> {end_time}",
                    text_line.strip(),
                    "",
                ])
    else:
        # Sentence-level
        for sentence in result.sentences:
            start_time = format_timestamp(sentence.start, decimal_marker=".")
            end_time = format_timestamp(sentence.end, decimal_marker=".")
            
            vtt_content.extend([
                f"{start_time} --> {end_time}",
                sentence.text.strip(),
                "",
            ])
    
    return "\n".join(vtt_content)

JSON Format

Structured JSON format with complete timing and confidence information.

Example Output

{
  "text": "Hello world. This is a test.",
  "sentences": [
    {
      "text": "Hello world.",
      "start": 0.0,
      "end": 1.95,
      "duration": 1.95,
      "confidence": 0.943,
      "tokens": [
        {
          "text": "Hello",
          "start": 0.0,
          "end": 0.42,
          "duration": 0.42,
          "confidence": 0.956
        },
        {
          "text": " world",
          "start": 0.42,
          "end": 0.95,
          "duration": 0.53,
          "confidence": 0.931
        },
        {
          "text": ".",
          "start": 0.95,
          "end": 1.95,
          "duration": 1.0,
          "confidence": 0.942
        }
      ]
    },
    {
      "text": "This is a test.",
      "start": 1.95,
      "end": 4.8,
      "duration": 2.85,
      "confidence": 0.962,
      "tokens": [
        {
          "text": " This",
          "start": 1.95,
          "end": 2.31,
          "duration": 0.36,
          "confidence": 0.978
        },
        {
          "text": " is",
          "start": 2.31,
          "end": 2.58,
          "duration": 0.27,
          "confidence": 0.965
        },
        {
          "text": " a",
          "start": 2.58,
          "end": 2.73,
          "duration": 0.15,
          "confidence": 0.941
        },
        {
          "text": " test",
          "start": 2.73,
          "end": 3.42,
          "duration": 0.69,
          "confidence": 0.953
        },
        {
          "text": ".",
          "start": 3.42,
          "end": 4.8,
          "duration": 1.38,
          "confidence": 0.974
        }
      ]
    }
  ]
}

CLI Usage

parakeet-mlx audio.mp3 --output-format json
# Creates: audio.json

Python Usage

import json
from parakeet_mlx import from_pretrained
from parakeet_mlx.cli import to_json

model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")
result = model.transcribe("audio.wav")

# Convert to JSON string
json_str = to_json(result)

# Parse as dictionary
data = json.loads(json_str)
print(data["text"])
print(f"Number of sentences: {len(data['sentences'])}")

# Access specific fields
for sentence in data["sentences"]:
    print(f"Sentence: {sentence['text']}")
    print(f"Duration: {sentence['duration']:.2f}s")
    print(f"Confidence: {sentence['confidence']:.2%}")
    print(f"Words: {len(sentence['tokens'])}")

# Save to file
with open("transcript.json", "w", encoding="utf-8") as f:
    f.write(json_str)

Schema

text
string
Full transcribed text
sentences
array
Array of sentence objects, each containing:
tokens
array
Array of token objects within each sentence:

Implementation

From the source code (parakeet_mlx/cli.py:143-171):
def to_json(result: AlignedResult) -> str:
    output_dict = {
        "text": result.text,
        "sentences": [
            _aligned_sentence_to_dict(sentence)
            for sentence in result.sentences
        ],
    }
    return json.dumps(output_dict, indent=2, ensure_ascii=False)

def _aligned_sentence_to_dict(sentence: AlignedSentence) -> Dict[str, Any]:
    return {
        "text": sentence.text,
        "start": round(sentence.start, 3),
        "end": round(sentence.end, 3),
        "duration": round(sentence.duration, 3),
        "confidence": round(sentence.confidence, 3),
        "tokens": [_aligned_token_to_dict(token) for token in sentence.tokens],
    }

def _aligned_token_to_dict(token: AlignedToken) -> Dict[str, Any]:
    return {
        "text": token.text,
        "start": round(token.start, 3),
        "end": round(token.end, 3),
        "duration": round(token.duration, 3),
        "confidence": round(token.confidence, 3),
    }

Generate All Formats

CLI

parakeet-mlx audio.mp3 --output-format all
# Creates:
# - audio.txt
# - audio.srt
# - audio.vtt
# - audio.json

Python

from parakeet_mlx import from_pretrained
from parakeet_mlx.cli import to_txt, to_srt, to_vtt, to_json
from pathlib import Path

model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")
result = model.transcribe("audio.wav")

# Generate all formats
formats = {
    "txt": to_txt(result),
    "srt": to_srt(result),
    "vtt": to_vtt(result),
    "json": to_json(result),
}

# Save all formats
output_dir = Path("transcripts")
output_dir.mkdir(exist_ok=True)

for ext, content in formats.items():
    output_path = output_dir / f"audio.{ext}"
    with open(output_path, "w", encoding="utf-8") as f:
        f.write(content)
    print(f"Saved: {output_path}")

Custom Formatting

You can also work directly with the result objects:
from parakeet_mlx import from_pretrained

model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")
result = model.transcribe("audio.wav")

# Custom CSV format
with open("transcript.csv", "w", encoding="utf-8") as f:
    f.write("start,end,text,confidence\n")
    for sentence in result.sentences:
        f.write(
            f"{sentence.start:.3f},"
            f"{sentence.end:.3f},"
            f'"{sentence.text}",'
            f"{sentence.confidence:.3f}\n"
        )

# Custom markdown format
with open("transcript.md", "w", encoding="utf-8") as f:
    f.write("# Transcription\n\n")
    for i, sentence in enumerate(result.sentences, 1):
        f.write(
            f"**[{sentence.start:.1f}s - {sentence.end:.1f}s]** "
            f"{sentence.text}\n\n"
        )

# Custom HTML format
with open("transcript.html", "w", encoding="utf-8") as f:
    f.write("<!DOCTYPE html><html><body>\n")
    f.write("<h1>Transcription</h1>\n")
    for sentence in result.sentences:
        f.write(
            f'<p data-start="{sentence.start}" data-end="{sentence.end}">'
            f'{sentence.text}</p>\n'
        )
    f.write("</body></html>\n")

Format Comparison

FormatUse CaseTimestampsConfidenceFile Size
TXTPlain text transcriptsSmallest
SRTVideo subtitles (desktop)✅ Sentence or wordSmall
VTTVideo subtitles (web)✅ Sentence or wordSmall
JSONProgrammatic access✅ Full detailLargest

Next Steps

Python API

Learn how to use the formatting functions

CLI Usage

Command-line output options

Chunking

Process long audio files

Streaming

Real-time transcription

Build docs developers (and LLMs) love