Skip to main content
The Parakeet MLX CLI provides a powerful command-line interface for transcribing audio files using Nvidia’s Parakeet ASR models on Apple Silicon.

Installation

Make sure you have ffmpeg installed on your system first, otherwise the CLI won’t work properly.
uv tool install parakeet-mlx -U

Basic Usage

The basic syntax for the CLI is:
parakeet-mlx <audio_files> [OPTIONS]

Quick Examples

parakeet-mlx audio.mp3

Core Options

Model Selection

--model
string
default:"mlx-community/parakeet-tdt-0.6b-v3"
Hugging Face repository of the model to use. Available models can be found in the mlx-community/parakeet collection.
parakeet-mlx audio.mp3 --model mlx-community/parakeet-tdt-1.1b
Set the PARAKEET_MODEL environment variable to avoid specifying the model every time:
export PARAKEET_MODEL=mlx-community/parakeet-tdt-1.1b

Output Configuration

--output-dir
path
default:"."
Directory to save transcription outputs.
--output-format
string
default:"srt"
Format for output files. Options: txt, srt, vtt, json, all
--output-template
string
default:"{filename}"
Template for output filenames. Available variables:
  • {filename} - Original filename without extension
  • {parent} - Parent directory path
  • {date} - Current date in YYYYMMDD format
  • {index} - File index (1-based)
parakeet-mlx audio.mp3 --output-dir ./transcriptions

Decoding Options

Decoding Method

--decoding
string
default:"greedy"
Decoding method to use: greedy or beam
Beam decoding is only available for TDT models and is slower but potentially more accurate.
Greedy decoding (fast)
parakeet-mlx audio.mp3 --decoding greedy
Beam decoding (accurate)
parakeet-mlx audio.mp3 --decoding beam --beam-size 5

Beam Search Parameters

These parameters only apply when using --decoding beam:
--beam-size
integer
default:"5"
Number of beams to maintain during search. Higher values increase accuracy but reduce speed.
--length-penalty
float
default:"0.013"
Length penalty for beam search. Set to 0.0 to disable. Higher values favor longer hypotheses.
--patience
float
default:"3.5"
Patience multiplier for beam search. Set to 1.0 to disable. Higher values keep more candidates.
--duration-reward
float
default:"0.67"
TDT-specific: Balance between token and duration logprobs (0.0-1.0).
  • < 0.5: Favor token probabilities
  • 0.5: Favor duration probabilities
Fine-tuned beam search
parakeet-mlx audio.mp3 \
  --decoding beam \
  --beam-size 10 \
  --length-penalty 0.02 \
  --patience 4.0 \
  --duration-reward 0.7

Sentence Splitting Options

Control how the transcription is split into sentences:
--max-words
integer
default:"None"
Maximum number of words per sentence.
--silence-gap
float
default:"None"
Split sentences at silence gaps longer than this duration (in seconds).
--max-duration
float
default:"None"
Maximum sentence duration in seconds.
Limit sentence length
parakeet-mlx audio.mp3 --max-words 20 --max-duration 10.0
Split on silence
parakeet-mlx audio.mp3 --silence-gap 2.0

Performance Options

Precision

--fp32 / --bf16
boolean
default:"bf16"
Choose floating-point precision:
  • --bf16: BFloat16 precision (default, faster, lower memory)
  • --fp32: Float32 precision (slower, higher memory, potentially more accurate)
Use FP32 precision
parakeet-mlx audio.mp3 --fp32

Attention Mechanism

--local-attention / --full-attention
boolean
default:"full-attention"
Attention mechanism to use:
  • --full-attention: Standard full attention (default)
  • --local-attention: Local attention (reduces memory for long audio)
--local-attention-context-size
integer
default:"256"
Context window size for local attention (in frames).
Use local attention for long audio
parakeet-mlx long_audio.mp3 \
  --local-attention \
  --local-attention-context-size 512 \
  --chunk-duration 0
Local attention is most useful when transcribing long audio files without chunking.

Cache Directory

--cache-dir
path
default:"None"
Directory for HuggingFace model cache. Defaults to ~/.cache/huggingface or the value of HF_HOME/HF_HUB_CACHE.
Custom cache location
parakeet-mlx audio.mp3 --cache-dir /path/to/cache

Subtitle Features

Word-Level Timestamps

--highlight-words
boolean
default:"false"
Generate word-level timestamps in SRT/VTT outputs. Each word appears highlighted as it’s spoken.
parakeet-mlx audio.mp3 --output-format srt

Verbose Mode

--verbose / -v
boolean
default:"false"
Print detailed progress information including:
  • Model loading status
  • Output directory and format
  • Per-file processing progress
  • Sentence-level timestamps and confidence scores
Enable verbose output
parakeet-mlx audio.mp3 -v
Example verbose output:
Loading model: mlx-community/parakeet-tdt-0.6b-v3...
Model loaded successfully.
Output directory: /current/directory
Output format(s): srt
Transcribing 1 file(s)...

Processing file 1/1: audio.mp3
[00:00:00,000 --> 00:00:02,340] (confidence: 95.32%) Hello world.
[00:00:02,340 --> 00:00:05,120] (confidence: 93.18%) This is a test.

Saved SRT: /current/directory/audio.srt

parakeet-tdt-0.6b-v3 transcription complete. Outputs saved in '/current/directory'.

Environment Variables

All options can be set via environment variables:
OptionEnvironment Variable
--modelPARAKEET_MODEL
--output-formatPARAKEET_OUTPUT_FORMAT
--output-templatePARAKEET_OUTPUT_TEMPLATE
--decodingPARAKEET_DECODING
--chunk-durationPARAKEET_CHUNK_DURATION
--overlap-durationPARAKEET_OVERLAP_DURATION
--beam-sizePARAKEET_BEAM_SIZE
--length-penaltyPARAKEET_LENGTH_PENALTY
--patiencePARAKEET_PATIENCE
--duration-rewardPARAKEET_DURATION_REWARD
--max-wordsPARAKEET_MAX_WORDS
--silence-gapPARAKEET_SILENCE_GAP
--max-durationPARAKEET_MAX_DURATION
--fp32PARAKEET_FP32
--local-attentionPARAKEET_LOCAL_ATTENTION
--local-attention-context-sizePARAKEET_LOCAL_ATTENTION_CTX
--cache-dirPARAKEET_CACHE_DIR
Example: Set default model
export PARAKEET_MODEL=mlx-community/parakeet-tdt-1.1b
export PARAKEET_OUTPUT_FORMAT=vtt
export PARAKEET_DECODING=beam

parakeet-mlx audio.mp3  # Uses environment defaults

Common Workflows

1

Basic Transcription

parakeet-mlx audio.mp3
Generates audio.srt in the current directory.
2

Batch Processing

parakeet-mlx *.mp3 --output-dir ./transcripts --output-format all
Transcribes all MP3 files and generates all output formats in the transcripts directory.
3

High-Quality Subtitles

parakeet-mlx video.mp4 \
  --output-format vtt \
  --highlight-words \
  --decoding beam \
  --beam-size 10 \
  --max-duration 8.0
Generates word-level VTT subtitles with beam search for maximum accuracy.
4

Long Audio Processing

parakeet-mlx podcast.mp3 \
  --chunk-duration 120 \
  --overlap-duration 15 \
  --output-format json \
  -v
Process long audio with chunking and verbose output. See Chunking Guide for details.

Troubleshooting

Install FFmpeg:
# macOS
brew install ffmpeg

# Ubuntu/Debian
sudo apt install ffmpeg
Try these solutions:
  1. Use BFloat16 precision (default): --bf16
  2. Enable chunking: --chunk-duration 60
  3. Use local attention: --local-attention
  4. Reduce beam size: --beam-size 3
Check your internet connection and HuggingFace access. You can also:
  1. Pre-download the model using huggingface-cli
  2. Set a custom cache directory: --cache-dir /path/to/cache

Next Steps

Chunking Guide

Learn how to efficiently process long audio files

Output Formats

Understand the different output format options

Python API

Use Parakeet MLX programmatically in your code

Streaming

Real-time transcription with streaming inference

Build docs developers (and LLMs) love