Output Formats - Parakeet MLX

Parakeet MLX supports multiple output formats for different use cases, from plain text to detailed JSON with word-level timestamps.

Supported Formats

TXT

Plain text transcription without timestamps

SRT

SubRip subtitle format with timestamps

VTT

WebVTT subtitle format for web videos

JSON

Structured data with full timing and confidence

Quick Reference

CLI

# Single format
parakeet-mlx audio.mp3 --output-format txt
parakeet-mlx audio.mp3 --output-format srt
parakeet-mlx audio.mp3 --output-format vtt
parakeet-mlx audio.mp3 --output-format json

# All formats at once
parakeet-mlx audio.mp3 --output-format all

Python API

from parakeet_mlx import from_pretrained
from parakeet_mlx.cli import to_txt, to_srt, to_vtt, to_json

model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")
result = model.transcribe("audio.wav")

# Format the result
text = to_txt(result)
srt = to_srt(result)
vtt = to_vtt(result)
json_str = to_json(result)

TXT Format

Plain text format containing just the transcribed text.

Example Output

Hello world. This is a test transcription. It contains multiple sentences.

CLI Usage

parakeet-mlx audio.mp3 --output-format txt
# Creates: audio.txt

Python Usage

from parakeet_mlx import from_pretrained
from parakeet_mlx.cli import to_txt

model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")
result = model.transcribe("audio.wav")

# Convert to plain text
text = to_txt(result)
print(text)
# Output: "Hello world. This is a test transcription."

# Save to file
with open("transcript.txt", "w", encoding="utf-8") as f:
    f.write(text)

Implementation

From the source code (parakeet_mlx/cli.py:47-49):

def to_txt(result: AlignedResult) -> str:
    """Format transcription result as plain text."""
    return result.text.strip()

TXT format strips leading/trailing whitespace but preserves internal formatting.

SRT Format

SubRip (SRT) format for video subtitles with timestamps.

Example Output

Sentence-level
Word-level

1
00:00:00,000 --> 00:00:02,150
Hello world.

2
00:00:02,150 --> 00:00:05,320
This is a test transcription.

3
00:00:05,320 --> 00:00:08,100
It contains multiple sentences.

1
00:00:00,000 --> 00:00:00,420
<u>Hello</u> world.

2
00:00:00,420 --> 00:00:00,950
Hello <u>world</u>.

3
00:00:02,150 --> 00:00:02,450
<u>This</u> is a test transcription.

4
00:00:02,450 --> 00:00:02,680
This <u>is</u> a test transcription.

CLI Usage

# Sentence-level timestamps (default)
parakeet-mlx audio.mp3 --output-format srt

# Word-level timestamps
parakeet-mlx audio.mp3 --output-format srt --highlight-words

Python Usage

from parakeet_mlx import from_pretrained
from parakeet_mlx.cli import to_srt

model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")
result = model.transcribe("audio.wav")

# Sentence-level SRT
srt_content = to_srt(result, highlight_words=False)

# Word-level SRT
srt_word_level = to_srt(result, highlight_words=True)

# Save to file
with open("subtitles.srt", "w", encoding="utf-8") as f:
    f.write(srt_content)

Format Specification

Entry number

integer

Sequential number starting from 1

Timestamp

string

Format: HH:MM:SS,mmm --> HH:MM:SS,mmm (comma as decimal separator)

Text

string

Subtitle text. For word-level, currently spoken word is wrapped in <u></u> tags.

Implementation Details

From the source code (parakeet_mlx/cli.py:52-97):

def to_srt(result: AlignedResult, highlight_words: bool = False) -> str:
    """Format transcription result as an SRT file."""
    srt_content = []
    entry_index = 1
    
    if highlight_words:
        # Word-level: Each word gets its own entry with highlighting
        for sentence in result.sentences:
            for i, token in enumerate(sentence.tokens):
                start_time = format_timestamp(token.start, decimal_marker=",")
                end_time = format_timestamp(
                    token.end if token == sentence.tokens[-1]
                    else sentence.tokens[i + 1].start,
                    decimal_marker=",",
                )
                
                # Build text with current word underlined
                text = ""
                for j, inner_token in enumerate(sentence.tokens):
                    if i == j:
                        text += inner_token.text.replace(
                            inner_token.text.strip(),
                            f"<u>{inner_token.text.strip()}</u>",
                        )
                    else:
                        text += inner_token.text
                
                srt_content.extend([
                    str(entry_index),
                    f"{start_time} --> {end_time}",
                    text.strip(),
                    "",
                ])
                entry_index += 1
    else:
        # Sentence-level: Each sentence gets one entry
        for sentence in result.sentences:
            start_time = format_timestamp(sentence.start, decimal_marker=",")
            end_time = format_timestamp(sentence.end, decimal_marker=",")
            
            srt_content.extend([
                str(entry_index),
                f"{start_time} --> {end_time}",
                sentence.text.strip(),
                "",
            ])
            entry_index += 1
    
    return "\n".join(srt_content)

VTT Format

WebVTT format for web-based video players.

Example Output

Sentence-level
Word-level

WEBVTT

00:00:00.000 --> 00:00:02.150
Hello world.

00:00:02.150 --> 00:00:05.320
This is a test transcription.

00:00:05.320 --> 00:00:08.100
It contains multiple sentences.

WEBVTT

00:00:00.000 --> 00:00:00.420
<b>Hello</b> world.

00:00:00.420 --> 00:00:00.950
Hello <b>world</b>.

00:00:02.150 --> 00:00:02.450
<b>This</b> is a test transcription.

00:00:02.450 --> 00:00:02.680
This <b>is</b> a test transcription.

CLI Usage

# Sentence-level timestamps (default)
parakeet-mlx audio.mp3 --output-format vtt

# Word-level timestamps
parakeet-mlx audio.mp3 --output-format vtt --highlight-words

Python Usage

from parakeet_mlx import from_pretrained
from parakeet_mlx.cli import to_vtt

model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")
result = model.transcribe("audio.wav")

# Sentence-level VTT
vtt_content = to_vtt(result, highlight_words=False)

# Word-level VTT  
vtt_word_level = to_vtt(result, highlight_words=True)

# Save to file
with open("subtitles.vtt", "w", encoding="utf-8") as f:
    f.write(vtt_content)

Differences from SRT

Feature	SRT	VTT
Header	None	`WEBVTT`
Decimal	Comma (`,`)	Period (`.`)
Word highlight	`<u>word</u>`	`<b>word</b>`
Use case	Desktop players	Web browsers

Implementation

From the source code (parakeet_mlx/cli.py:100-140):

def to_vtt(result: AlignedResult, highlight_words: bool = False) -> str:
    """Format transcription result as a VTT file."""
    vtt_content = ["WEBVTT", ""]  # VTT header
    
    if highlight_words:
        # Word-level with bold highlighting
        for sentence in result.sentences:
            for i, token in enumerate(sentence.tokens):
                start_time = format_timestamp(token.start, decimal_marker=".")
                end_time = format_timestamp(
                    token.end if token == sentence.tokens[-1]
                    else sentence.tokens[i + 1].start,
                    decimal_marker=".",
                )
                
                text_line = ""
                for j, inner_token in enumerate(sentence.tokens):
                    if i == j:
                        text_line += inner_token.text.replace(
                            inner_token.text.strip(),
                            f"<b>{inner_token.text.strip()}</b>",
                        )
                    else:
                        text_line += inner_token.text
                
                vtt_content.extend([
                    f"{start_time} --> {end_time}",
                    text_line.strip(),
                    "",
                ])
    else:
        # Sentence-level
        for sentence in result.sentences:
            start_time = format_timestamp(sentence.start, decimal_marker=".")
            end_time = format_timestamp(sentence.end, decimal_marker=".")
            
            vtt_content.extend([
                f"{start_time} --> {end_time}",
                sentence.text.strip(),
                "",
            ])
    
    return "\n".join(vtt_content)

JSON Format

Structured JSON format with complete timing and confidence information.

Example Output

{
  "text": "Hello world. This is a test.",
  "sentences": [
    {
      "text": "Hello world.",
      "start": 0.0,
      "end": 1.95,
      "duration": 1.95,
      "confidence": 0.943,
      "tokens": [
        {
          "text": "Hello",
          "start": 0.0,
          "end": 0.42,
          "duration": 0.42,
          "confidence": 0.956
        },
        {
          "text": " world",
          "start": 0.42,
          "end": 0.95,
          "duration": 0.53,
          "confidence": 0.931
        },
        {
          "text": ".",
          "start": 0.95,
          "end": 1.95,
          "duration": 1.0,
          "confidence": 0.942
        }
      ]
    },
    {
      "text": "This is a test.",
      "start": 1.95,
      "end": 4.8,
      "duration": 2.85,
      "confidence": 0.962,
      "tokens": [
        {
          "text": " This",
          "start": 1.95,
          "end": 2.31,
          "duration": 0.36,
          "confidence": 0.978
        },
        {
          "text": " is",
          "start": 2.31,
          "end": 2.58,
          "duration": 0.27,
          "confidence": 0.965
        },
        {
          "text": " a",
          "start": 2.58,
          "end": 2.73,
          "duration": 0.15,
          "confidence": 0.941
        },
        {
          "text": " test",
          "start": 2.73,
          "end": 3.42,
          "duration": 0.69,
          "confidence": 0.953
        },
        {
          "text": ".",
          "start": 3.42,
          "end": 4.8,
          "duration": 1.38,
          "confidence": 0.974
        }
      ]
    }
  ]
}

CLI Usage

parakeet-mlx audio.mp3 --output-format json
# Creates: audio.json

Python Usage

import json
from parakeet_mlx import from_pretrained
from parakeet_mlx.cli import to_json

model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")
result = model.transcribe("audio.wav")

# Convert to JSON string
json_str = to_json(result)

# Parse as dictionary
data = json.loads(json_str)
print(data["text"])
print(f"Number of sentences: {len(data['sentences'])}")

# Access specific fields
for sentence in data["sentences"]:
    print(f"Sentence: {sentence['text']}")
    print(f"Duration: {sentence['duration']:.2f}s")
    print(f"Confidence: {sentence['confidence']:.2%}")
    print(f"Words: {len(sentence['tokens'])}")

# Save to file
with open("transcript.json", "w", encoding="utf-8") as f:
    f.write(json_str)

Schema

text

string

Full transcribed text

sentences

array

Array of sentence objects, each containing:

Show Sentence fields

text

string

Sentence text

start

number

Start time in seconds (rounded to 3 decimals)

end

number

End time in seconds (rounded to 3 decimals)

duration

number

Duration in seconds (rounded to 3 decimals)

confidence

number

Confidence score 0-1 (rounded to 3 decimals)

tokens

array

Array of word/token objects

tokens

array

Array of token objects within each sentence:

Show Token fields

text

string

Token text (may include leading/trailing whitespace)

start

number

Start time in seconds (rounded to 3 decimals)

end

number

End time in seconds (rounded to 3 decimals)

duration

number

Duration in seconds (rounded to 3 decimals)

confidence

number

Confidence score 0-1 (rounded to 3 decimals)

Implementation

From the source code (parakeet_mlx/cli.py:143-171):

def to_json(result: AlignedResult) -> str:
    output_dict = {
        "text": result.text,
        "sentences": [
            _aligned_sentence_to_dict(sentence)
            for sentence in result.sentences
        ],
    }
    return json.dumps(output_dict, indent=2, ensure_ascii=False)

def _aligned_sentence_to_dict(sentence: AlignedSentence) -> Dict[str, Any]:
    return {
        "text": sentence.text,
        "start": round(sentence.start, 3),
        "end": round(sentence.end, 3),
        "duration": round(sentence.duration, 3),
        "confidence": round(sentence.confidence, 3),
        "tokens": [_aligned_token_to_dict(token) for token in sentence.tokens],
    }

def _aligned_token_to_dict(token: AlignedToken) -> Dict[str, Any]:
    return {
        "text": token.text,
        "start": round(token.start, 3),
        "end": round(token.end, 3),
        "duration": round(token.duration, 3),
        "confidence": round(token.confidence, 3),
    }

Generate All Formats

CLI

parakeet-mlx audio.mp3 --output-format all
# Creates:
# - audio.txt
# - audio.srt
# - audio.vtt
# - audio.json

Python

from parakeet_mlx import from_pretrained
from parakeet_mlx.cli import to_txt, to_srt, to_vtt, to_json
from pathlib import Path

model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")
result = model.transcribe("audio.wav")

# Generate all formats
formats = {
    "txt": to_txt(result),
    "srt": to_srt(result),
    "vtt": to_vtt(result),
    "json": to_json(result),
}

# Save all formats
output_dir = Path("transcripts")
output_dir.mkdir(exist_ok=True)

for ext, content in formats.items():
    output_path = output_dir / f"audio.{ext}"
    with open(output_path, "w", encoding="utf-8") as f:
        f.write(content)
    print(f"Saved: {output_path}")

Custom Formatting

You can also work directly with the result objects:

from parakeet_mlx import from_pretrained

model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")
result = model.transcribe("audio.wav")

# Custom CSV format
with open("transcript.csv", "w", encoding="utf-8") as f:
    f.write("start,end,text,confidence\n")
    for sentence in result.sentences:
        f.write(
            f"{sentence.start:.3f},"
            f"{sentence.end:.3f},"
            f'"{sentence.text}",'
            f"{sentence.confidence:.3f}\n"
        )

# Custom markdown format
with open("transcript.md", "w", encoding="utf-8") as f:
    f.write("# Transcription\n\n")
    for i, sentence in enumerate(result.sentences, 1):
        f.write(
            f"**[{sentence.start:.1f}s - {sentence.end:.1f}s]** "
            f"{sentence.text}\n\n"
        )

# Custom HTML format
with open("transcript.html", "w", encoding="utf-8") as f:
    f.write("<!DOCTYPE html><html><body>\n")
    f.write("<h1>Transcription</h1>\n")
    for sentence in result.sentences:
        f.write(
            f'<p data-start="{sentence.start}" data-end="{sentence.end}">'
            f'{sentence.text}</p>\n'
        )
    f.write("</body></html>\n")

Format Comparison

Format	Use Case	Timestamps	Confidence	File Size
TXT	Plain text transcripts	❌	❌	Smallest
SRT	Video subtitles (desktop)	✅ Sentence or word	❌	Small
VTT	Video subtitles (web)	✅ Sentence or word	❌	Small
JSON	Programmatic access	✅ Full detail	✅	Largest

Next Steps

Python API

Learn how to use the formatting functions

CLI Usage

Command-line output options

Chunking

Process long audio files

Streaming

Real-time transcription

Get Started

Core Concepts

Guides

Advanced

Documentation Index

​Supported Formats

TXT

SRT

VTT

JSON

​Quick Reference

​CLI

​Python API

​TXT Format

​Example Output

​CLI Usage

​Python Usage

​Implementation

​SRT Format

​Example Output

​CLI Usage

​Python Usage

​Format Specification

​Implementation Details

​VTT Format

​Example Output

​CLI Usage

​Python Usage

​Differences from SRT

​Implementation

​JSON Format

​Example Output

​CLI Usage

​Python Usage

​Schema

​Implementation

​Generate All Formats

​CLI

​Python

​Custom Formatting

​Format Comparison

​Next Steps

Python API

CLI Usage

Chunking

Streaming

Build docs developers (and LLMs) love

Supported Formats

Quick Reference

CLI

Python API

TXT Format

Example Output

CLI Usage

Python Usage

Implementation

SRT Format

Example Output

CLI Usage

Python Usage

Format Specification

Implementation Details

VTT Format

Example Output

CLI Usage

Python Usage

Differences from SRT

Implementation

JSON Format

Example Output

CLI Usage

Python Usage

Schema

Implementation

Generate All Formats

CLI

Python

Custom Formatting

Format Comparison

Next Steps