Python API

The Parakeet MLX Python API provides a clean, powerful interface for integrating speech recognition into your applications.

Installation

Install the package using your preferred package manager:

uv add parakeet-mlx -U

Quick Start

Import the library

from parakeet_mlx import from_pretrained

Load a model

model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")

Transcribe audio

result = model.transcribe("audio_file.wav")
print(result.text)

Loading Models

from_pretrained()

The from_pretrained() function downloads and loads a model from Hugging Face:

from parakeet_mlx import from_pretrained
import mlx.core as mx

# Load with default settings
model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")

# Load with BFloat16 precision (default)
model = from_pretrained(
    "mlx-community/parakeet-tdt-0.6b-v3",
    dtype=mx.bfloat16
)

# Load with Float32 precision
model = from_pretrained(
    "mlx-community/parakeet-tdt-0.6b-v3",
    dtype=mx.float32
)

# Custom cache directory
model = from_pretrained(
    "mlx-community/parakeet-tdt-0.6b-v3",
    cache_dir="/path/to/cache"
)

Models are cached in HuggingFace’s default cache directory (~/.cache/huggingface) or the location specified by HF_HOME/HF_HUB_CACHE environment variables.

Available Models

Browse all available models in the mlx-community/parakeet collection. Popular models:

mlx-community/parakeet-tdt-0.6b-v3 - Fast, accurate TDT model (recommended)
mlx-community/parakeet-tdt-1.1b - Larger TDT model
mlx-community/parakeet-ctc-0.6b - CTC-based model
mlx-community/parakeet-rnnt-0.6b - RNN-T based model

Model Types

The from_pretrained() function returns one of these model types:

from parakeet_mlx import (
    BaseParakeet,      # Abstract base class
    ParakeetTDT,       # Token-Duration-Transducer model
    ParakeetRNNT,      # RNN-Transducer model
    ParakeetCTC,       # CTC model
    ParakeetTDTCTC,    # TDT with auxiliary CTC
)

For most use cases, the BaseParakeet abstraction is sufficient:

from parakeet_mlx import from_pretrained, BaseParakeet

model: BaseParakeet = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")

Basic Transcription

Simple Transcription

from parakeet_mlx import from_pretrained

model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")

# Transcribe a file
result = model.transcribe("audio.wav")
print(result.text)
# Output: "Hello world. This is a test."

Working with Timestamps

The transcribe() method returns an AlignedResult object with detailed timing information:

result = model.transcribe("audio.wav")

# Full text
print(result.text)

# Sentence-level timestamps
for sentence in result.sentences:
    print(f"[{sentence.start:.2f}s - {sentence.end:.2f}s] {sentence.text}")
    print(f"  Duration: {sentence.duration:.2f}s")
    print(f"  Confidence: {sentence.confidence:.2%}")

# Word-level timestamps
for sentence in result.sentences:
    for token in sentence.tokens:
        print(f"{token.text} [{token.start:.2f}s - {token.end:.2f}s]")

Output:

[0.20s - 2.15s] Hello world.
  Duration: 1.95s
  Confidence: 94.32%
[2.15s - 4.80s] This is a test.
  Duration: 2.65s
  Confidence: 96.18%

Result Objects

AlignedResult

from parakeet_mlx import AlignedResult

result: AlignedResult = model.transcribe("audio.wav")

# Full transcribed text
print(result.text)  # str

# List of sentences with timestamps
print(result.sentences)  # list[AlignedSentence]

# All tokens (flattened from all sentences)
print(result.tokens)  # list[AlignedToken]

AlignedSentence

from parakeet_mlx import AlignedSentence

sentence: AlignedSentence = result.sentences[0]

print(sentence.text)        # str - Sentence text
print(sentence.start)       # float - Start time in seconds
print(sentence.end)         # float - End time in seconds
print(sentence.duration)    # float - Duration in seconds
print(sentence.confidence)  # float - Confidence score (0-1)
print(sentence.tokens)      # list[AlignedToken] - Words in sentence

AlignedToken

from parakeet_mlx import AlignedToken

token: AlignedToken = sentence.tokens[0]

print(token.text)        # str - Token text
print(token.start)       # float - Start time in seconds
print(token.end)         # float - End time in seconds
print(token.duration)    # float - Duration in seconds
print(token.confidence)  # float - Confidence score (0-1)
print(token.id)         # int - Token ID in vocabulary

Decoding Configuration

Greedy Decoding (Default)

Greedy decoding selects the most probable token at each step:

from parakeet_mlx import from_pretrained, DecodingConfig, Greedy

model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")

config = DecodingConfig(decoding=Greedy())
result = model.transcribe("audio.wav", decoding_config=config)

Beam Search Decoding

Beam search explores multiple hypotheses for better accuracy:

from parakeet_mlx import from_pretrained, DecodingConfig, Beam

model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")

config = DecodingConfig(
    decoding=Beam(
        beam_size=5,          # Number of beams (default: 5)
        length_penalty=0.013,  # Length penalty (default: 0.013)
        patience=3.5,          # Patience multiplier (default: 3.5)
        duration_reward=0.67,  # TDT: balance token/duration probs (default: 0.67)
    )
)

result = model.transcribe("audio.wav", decoding_config=config)

Beam decoding is currently only supported for TDT models and is significantly slower than greedy decoding.

Sentence Configuration

Control how transcriptions are split into sentences:

from parakeet_mlx import from_pretrained, DecodingConfig, SentenceConfig

model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")

config = DecodingConfig(
    sentence=SentenceConfig(
        max_words=30,        # Maximum words per sentence
        silence_gap=5.0,     # Split at silences > 5 seconds
        max_duration=40.0,   # Maximum sentence duration in seconds
    )
)

result = model.transcribe("audio.wav", decoding_config=config)

Combined Configuration

from parakeet_mlx import from_pretrained, DecodingConfig, Beam, SentenceConfig

model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")

config = DecodingConfig(
    decoding=Beam(beam_size=10, length_penalty=0.02),
    sentence=SentenceConfig(max_words=20, max_duration=30.0)
)

result = model.transcribe("audio.wav", decoding_config=config)

Chunking for Long Audio

For long audio files, use chunking to process the audio in smaller segments:

from parakeet_mlx import from_pretrained

model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")

result = model.transcribe(
    "long_audio.wav",
    chunk_duration=120.0,     # 2 minutes per chunk
    overlap_duration=15.0,    # 15 seconds overlap
)

print(result.text)

Progress Callback

Track chunking progress with a callback:

def progress_callback(current, total):
    progress = (current / total) * 100
    print(f"Progress: {progress:.1f}%", end="\r")

result = model.transcribe(
    "long_audio.wav",
    chunk_duration=120.0,
    overlap_duration=15.0,
    chunk_callback=progress_callback
)

See the Chunking Guide for detailed information.

Attention Mechanisms

Local Attention

Reduce memory usage for long audio by using local attention:

from parakeet_mlx import from_pretrained

model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")

# Enable local attention with context window
model.encoder.set_attention_model(
    "rel_pos_local_attn",  # Follows NeMo's naming convention
    (256, 256),            # (left_context, right_context) in frames
)

result = model.transcribe("long_audio.wav")

Local attention is most effective when processing long audio without chunking.

Low-Level API

Direct Mel-Spectrogram Processing

For advanced use cases, you can process mel-spectrograms directly:

import mlx.core as mx
from parakeet_mlx import from_pretrained
from parakeet_mlx.audio import load_audio, get_logmel

model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")

# Load and preprocess audio manually
audio = load_audio("audio.wav", model.preprocessor_config.sample_rate)
mel = get_logmel(audio, model.preprocessor_config)

# Generate transcription from mel-spectrogram
# Input shape: [batch, sequence, features] or [sequence, features]
results = model.generate(mel)  # Returns list[AlignedResult]

print(results[0].text)

Batch Processing

import mlx.core as mx
from parakeet_mlx import from_pretrained
from parakeet_mlx.audio import load_audio, get_logmel

model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")

# Load multiple audio files
audio_files = ["audio1.wav", "audio2.wav", "audio3.wav"]
mel_specs = []

for file in audio_files:
    audio = load_audio(file, model.preprocessor_config.sample_rate)
    mel = get_logmel(audio, model.preprocessor_config)
    mel_specs.append(mel)

# Stack into batch (requires same length or padding)
batch_mel = mx.concatenate([mx.expand_dims(m, 0) for m in mel_specs], axis=0)

# Generate for batch
results = model.generate(batch_mel)

for i, result in enumerate(results):
    print(f"File {i+1}: {result.text}")

Precision Control

import mlx.core as mx
from parakeet_mlx import from_pretrained

# Load model in BFloat16 (default, recommended)
model = from_pretrained(
    "mlx-community/parakeet-tdt-0.6b-v3",
    dtype=mx.bfloat16
)

# Transcribe in BFloat16
result = model.transcribe("audio.wav", dtype=mx.bfloat16)

# Or use Float32 for potentially higher accuracy
model = from_pretrained(
    "mlx-community/parakeet-tdt-0.6b-v3",
    dtype=mx.float32
)
result = model.transcribe("audio.wav", dtype=mx.float32)

BFloat16 is recommended as it provides a good balance between speed, memory usage, and accuracy.

Complete Example

Here’s a comprehensive example demonstrating common API patterns:

import mlx.core as mx
from parakeet_mlx import (
    from_pretrained,
    DecodingConfig,
    Beam,
    SentenceConfig,
)

# Load model
model = from_pretrained(
    "mlx-community/parakeet-tdt-0.6b-v3",
    dtype=mx.bfloat16
)

# Configure decoding
config = DecodingConfig(
    decoding=Beam(
        beam_size=5,
        length_penalty=0.013,
        patience=3.5,
        duration_reward=0.67,
    ),
    sentence=SentenceConfig(
        max_words=25,
        silence_gap=3.0,
        max_duration=30.0,
    ),
)

# Transcribe with progress tracking
def show_progress(current, total):
    print(f"Processing: {current}/{total} samples", end="\r")

result = model.transcribe(
    "audio.wav",
    dtype=mx.bfloat16,
    decoding_config=config,
    chunk_duration=120.0,
    overlap_duration=15.0,
    chunk_callback=show_progress,
)

# Display results
print(f"\n\nFull text: {result.text}\n")

for i, sentence in enumerate(result.sentences, 1):
    print(f"Sentence {i}:")
    print(f"  Time: {sentence.start:.2f}s - {sentence.end:.2f}s")
    print(f"  Text: {sentence.text}")
    print(f"  Confidence: {sentence.confidence:.2%}")
    print()

# Export word-level timestamps
for sentence in result.sentences:
    for token in sentence.tokens:
        print(
            f"{token.start:.3f}\t{token.end:.3f}\t"
            f"{token.text}\t{token.confidence:.3f}"
        )

Type Hints

For better IDE support and type checking:

from typing import List, Optional
from pathlib import Path
import mlx.core as mx
from parakeet_mlx import (
    BaseParakeet,
    ParakeetTDT,
    AlignedResult,
    AlignedSentence,
    AlignedToken,
    DecodingConfig,
    from_pretrained,
)

def transcribe_file(audio_path: Path) -> AlignedResult:
    model: BaseParakeet = from_pretrained(
        "mlx-community/parakeet-tdt-0.6b-v3"
    )
    result: AlignedResult = model.transcribe(str(audio_path))
    return result

def extract_sentences(result: AlignedResult) -> List[str]:
    sentences: List[AlignedSentence] = result.sentences
    return [s.text for s in sentences]

def extract_timestamps(result: AlignedResult) -> List[tuple[float, float, str]]:
    return [
        (sentence.start, sentence.end, sentence.text)
        for sentence in result.sentences
    ]

Next Steps

Streaming

Learn how to do real-time transcription

Chunking

Process long audio files efficiently

Output Formats

Export transcriptions in different formats

CLI Usage

Use the command-line interface

Get Started

Core Concepts

Guides

Advanced

Installation

Quick Start

Loading Models

from_pretrained()

Available Models

Model Types

Basic Transcription

Simple Transcription

Working with Timestamps

Result Objects

AlignedResult

AlignedSentence

AlignedToken

Decoding Configuration

Greedy Decoding (Default)

Beam Search Decoding

Sentence Configuration

Combined Configuration

Chunking for Long Audio

Progress Callback

Attention Mechanisms

Local Attention

Low-Level API

Direct Mel-Spectrogram Processing

Batch Processing

Precision Control

Complete Example

Type Hints

Next Steps

Streaming

Chunking

Output Formats

CLI Usage

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Advanced

Documentation Index

​Installation

​Quick Start

​Loading Models

​from_pretrained()

​Available Models

​Model Types

​Basic Transcription

​Simple Transcription

​Working with Timestamps

​Result Objects

​AlignedResult

​AlignedSentence

​AlignedToken

​Decoding Configuration

​Greedy Decoding (Default)

​Beam Search Decoding

​Sentence Configuration

​Combined Configuration

​Chunking for Long Audio

​Progress Callback

​Attention Mechanisms

​Local Attention

​Low-Level API

​Direct Mel-Spectrogram Processing

​Batch Processing

​Precision Control

​Complete Example

​Type Hints

​Next Steps

Streaming

Chunking

Output Formats

CLI Usage

Build docs developers (and LLMs) love

Installation

Quick Start

Loading Models

from_pretrained()

Available Models

Model Types

Basic Transcription

Simple Transcription

Working with Timestamps

Result Objects

AlignedResult

AlignedSentence

AlignedToken

Decoding Configuration

Greedy Decoding (Default)

Beam Search Decoding

Sentence Configuration

Combined Configuration

Chunking for Long Audio

Progress Callback

Attention Mechanisms

Local Attention

Low-Level API

Direct Mel-Spectrogram Processing

Batch Processing

Precision Control

Complete Example

Type Hints

Next Steps