Skip to main content
Let’s transcribe an audio file in just a few steps. This guide assumes you’ve already installed Parakeet MLX.

Quick Start

Transcribe an audio file with a single command:
parakeet-mlx audio.mp3
This creates audio.srt in the current directory with timestamped transcription.
By default, the CLI uses the mlx-community/parakeet-tdt-0.6b-v3 model and generates SRT subtitle format.

Common Use Cases

Batch Processing Multiple Files

parakeet-mlx *.mp3 --output-format vtt

Long Audio with Chunking

For audio longer than a few minutes, enable chunking to manage memory:
parakeet-mlx long_podcast.mp3 --chunk-duration 120 --overlap-duration 15
The default chunk duration is 120 seconds with 15 seconds of overlap. This works well for most speech content.

Real-Time Streaming

For live audio transcription:
from parakeet_mlx import from_pretrained
from parakeet_mlx.audio import load_audio

model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")

# Create streaming context
with model.transcribe_stream(context_size=(256, 256)) as transcriber:
    # Simulate real-time audio chunks
    audio_data = load_audio("audio.mp3", model.preprocessor_config.sample_rate)
    chunk_size = model.preprocessor_config.sample_rate  # 1 second chunks
    
    for i in range(0, len(audio_data), chunk_size):
        chunk = audio_data[i:i+chunk_size]
        transcriber.add_audio(chunk)
        
        # Get current transcription
        print(f"Current: {transcriber.result.text}")

Beam Search for Higher Accuracy

Trade speed for accuracy with beam search:
parakeet-mlx audio.mp3 --decoding beam --beam-size 5

Custom Sentence Splitting

Control how text is split into sentences for subtitles:
from parakeet_mlx import from_pretrained, DecodingConfig, SentenceConfig

model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")

config = DecodingConfig(
    sentence=SentenceConfig(
        max_words=30,          # Max 30 words per subtitle
        silence_gap=5.0,       # Split on 5+ second silence
        max_duration=40.0      # Max 40 second duration
    )
)

result = model.transcribe("audio.mp3", decoding_config=config)

# Each sentence now follows these constraints
for sentence in result.sentences:
    print(f"[{sentence.start:.2f}s - {sentence.end:.2f}s] {sentence.text}")

Performance Tips

BFloat16 is 2x faster than FP32 with minimal accuracy loss on Apple Silicon:
from mlx.core import bfloat16

model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3", dtype=bfloat16)
Reduce memory usage for very long audio files:
model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")
model.encoder.set_attention_model("rel_pos_local_attn", (256, 256))

result = model.transcribe("very_long_audio.mp3")
  • TDT models: Best accuracy, beam search support (recommended)
  • RNNT models: Good balance of speed and accuracy
  • CTC models: Fastest, simpler architecture
Load the model once and reuse it for multiple transcriptions:
model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")

for audio_file in audio_files:
    result = model.transcribe(audio_file)
    # Process result...

Advanced Examples

Low-Level API with Mel Spectrograms

For custom preprocessing pipelines:
from parakeet_mlx import from_pretrained, DecodingConfig
from parakeet_mlx.audio import load_audio, get_logmel

model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")

# Load and preprocess audio manually
audio = load_audio("audio.mp3", model.preprocessor_config.sample_rate)
mel = get_logmel(audio, model.preprocessor_config)

# Generate transcription
alignments = model.generate(mel, decoding_config=DecodingConfig())

# alignments is a list of AlignedResult
for result in alignments:
    print(result.text)

Streaming with Custom Context Size

Fine-tune streaming performance:
from parakeet_mlx import from_pretrained

model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")

with model.transcribe_stream(
    context_size=(512, 512),  # Larger context = better accuracy, more memory
    depth=2                   # More layers preserve computation accuracy
) as transcriber:
    # Process audio chunks...
    pass

Troubleshooting

If installed with uv tool, ensure uv’s bin directory is in your PATH:
export PATH="$HOME/.local/bin:$PATH"
Or reinstall with pip:
pip install parakeet-mlx -U
Verify installation in your active Python environment:
pip show parakeet-mlx
python -c "import parakeet_mlx; print(parakeet_mlx.__file__)"
Install FFmpeg for audio file support:
# macOS
brew install ffmpeg

# Linux
sudo apt install ffmpeg
Try these solutions:
  1. Enable chunking: --chunk-duration 120
  2. Use local attention: --local-attention
  3. Close other applications to free up RAM
  4. Choose a smaller model variant
The first transcription downloads the model (~600MB) from Hugging Face and caches it locally. Subsequent runs will be much faster.You can pre-download models:
from parakeet_mlx import from_pretrained
model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")

Next Steps

Python API Guide

Learn about advanced Python API features

CLI Usage

Explore all CLI options and workflows

Streaming

Set up real-time audio transcription

Output Formats

Learn about SRT, VTT, JSON, and custom formats

Build docs developers (and LLMs) love