Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/senstella/parakeet-mlx/llms.txt

Use this file to discover all available pages before exploring further.

Overview

BaseParakeet is the abstract base class that defines the common interface for all Parakeet model variants. It provides three core methods for transcription:
  • transcribe() - Transcribe audio files
  • transcribe_stream() - Real-time streaming transcription
  • generate() - Low-level mel-spectrogram to text
All model variants (ParakeetTDT, ParakeetRNNT, ParakeetCTC, ParakeetTDTCTC) inherit from this class.

Class Definition

class BaseParakeet(nn.Module):
    def __init__(self, preprocess_args: PreprocessArgs, encoder_args: ConformerArgs):
        ...

Properties

time_ratio

@property
float time_ratio
The time ratio between encoder output frames and input audio samples. Used internally to convert frame indices to timestamps. Formula:
time_ratio = (subsampling_factor / sample_rate) * hop_length

Methods

transcribe()

Transcribe an audio file with optional chunking for long files.
def transcribe(
    self,
    path: Path | str,
    *,
    dtype: mx.Dtype = mx.bfloat16,
    decoding_config: DecodingConfig = DecodingConfig(),
    chunk_duration: Optional[float] = None,
    overlap_duration: float = 15.0,
    chunk_callback: Optional[Callable] = None,
) -> AlignedResult

Parameters

path
Path | str
required
Path to the audio file. Supports WAV, MP3, FLAC, and other formats supported by audiofile.
dtype
mx.Dtype
default:"mx.bfloat16"
Data type for audio processing. Should match the model’s dtype.
decoding_config
DecodingConfig
default:"DecodingConfig()"
Configuration for decoding behavior and sentence splitting. See DecodingConfig.
chunk_duration
float | None
default:"None"
If provided, splits audio into chunks of this duration (in seconds). When None, processes the entire file at once.Use chunking for:
  • Very long audio files (> 5 minutes)
  • Memory-constrained environments
  • Processing audio that exceeds available RAM
overlap_duration
float
default:"15.0"
Overlap between consecutive chunks in seconds. Only used when chunk_duration is specified.Higher overlap improves accuracy at chunk boundaries but increases computation time.
chunk_callback
Callable | None
default:"None"
Callback function called after processing each chunk. Receives (current_position, total_length) in samples.Useful for progress tracking:
def progress(current, total):
    percent = (current / total) * 100
    print(f"Progress: {percent:.1f}%")

result = model.transcribe("audio.wav", chunk_callback=progress)

Returns

result
AlignedResult
Transcription result with aligned tokens and sentences. See AlignedResult.

Examples

Basic transcription:
from parakeet_mlx import from_pretrained

model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")
result = model.transcribe("interview.wav")

print(result.text)
for sentence in result.sentences:
    print(f"[{sentence.start:.2f}s - {sentence.end:.2f}s] {sentence.text}")
With chunking for long audio:
result = model.transcribe(
    "long_podcast.wav",
    chunk_duration=120.0,  # 2 minute chunks
    overlap_duration=15.0   # 15 second overlap
)
With custom decoding config:
from parakeet_mlx import DecodingConfig, Beam, SentenceConfig

config = DecodingConfig(
    decoding=Beam(beam_size=5, length_penalty=0.013),
    sentence=SentenceConfig(max_words=25, silence_gap=3.0)
)

result = model.transcribe("audio.wav", decoding_config=config)

transcribe_stream()

Create a streaming context for real-time transcription.
def transcribe_stream(
    self,
    context_size: tuple[int, int] = (256, 256),
    depth: int = 1,
    *,
    keep_original_attention: bool = False,
    decoding_config: DecodingConfig = DecodingConfig(),
) -> StreamingParakeet

Parameters

context_size
tuple[int, int]
default:"(256, 256)"
A pair (left_context, right_context) specifying attention context windows in encoder frames.
  • left_context: How many past frames to attend to
  • right_context: How many future frames to attend to (lookahead)
Larger contexts improve accuracy but increase latency and memory usage.
depth
int
default:"1"
Number of encoder layers that preserve exact computation across chunks.
  • depth=1 (default): Only first layer’s cache matches exactly
  • depth=2: First two layers match exactly
  • depth=N: All N layers match (full equivalence to non-streaming)
Higher depth increases accuracy but requires more memory for caching.
keep_original_attention
bool
default:"False"
Whether to preserve the original attention mechanism.
  • False (default): Switches to local attention for streaming
  • True: Keeps original attention (less suitable for streaming)
decoding_config
DecodingConfig
default:"DecodingConfig()"
Configuration for decoding behavior and sentence splitting.

Returns

streamer
StreamingParakeet
A context manager for streaming inference. Use with Python’s with statement.

Examples

Basic streaming:
import mlx.core as mx
from parakeet_mlx import from_pretrained
from parakeet_mlx.audio import load_audio

model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")

with model.transcribe_stream(context_size=(256, 256)) as stream:
    # Simulate real-time audio
    audio = load_audio("audio.wav", model.preprocessor_config.sample_rate)
    chunk_size = model.preprocessor_config.sample_rate  # 1 second chunks
    
    for i in range(0, len(audio), chunk_size):
        chunk = audio[i:i+chunk_size]
        stream.add_audio(chunk)
        
        # Get current transcription
        result = stream.result
        print(f"\rCurrent: {result.text}", end="")
    
    # Get final result
    final = stream.result
    print(f"\nFinal: {final.text}")
With custom depth and context:
with model.transcribe_stream(
    context_size=(512, 512),  # Larger context for better accuracy
    depth=3                    # Cache first 3 layers exactly
) as stream:
    # ... process audio ...
    pass
Accessing finalized vs draft tokens:
with model.transcribe_stream() as stream:
    stream.add_audio(audio_chunk)
    
    # Finalized tokens won't change
    print("Finalized:", [t.text for t in stream.finalized_tokens])
    
    # Draft tokens may change with new audio
    print("Draft:", [t.text for t in stream.draft_tokens])

generate()

Generate transcription from mel-spectrogram input. This is the low-level interface used by transcribe().
def generate(
    self,
    mel: mx.array,
    *,
    decoding_config: DecodingConfig = DecodingConfig(),
) -> list[AlignedResult]

Parameters

mel
mx.array
required
Mel-spectrogram input with shape:
  • [batch, sequence, mel_dim] for batch processing, or
  • [sequence, mel_dim] for single input
Generate mel-spectrograms using:
from parakeet_mlx.audio import get_logmel, load_audio

audio = load_audio("audio.wav", model.preprocessor_config.sample_rate)
mel = get_logmel(audio, model.preprocessor_config)
decoding_config
DecodingConfig
default:"DecodingConfig()"
Configuration object controlling decoding behavior and sentence splitting.

Returns

results
list[AlignedResult]
List of transcription results with aligned tokens and sentences, one for each input in the batch.

Examples

Single input:
import mlx.core as mx
from parakeet_mlx import from_pretrained
from parakeet_mlx.audio import get_logmel, load_audio

model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")

# Load and preprocess audio
audio = load_audio("audio.wav", model.preprocessor_config.sample_rate)
mel = get_logmel(audio, model.preprocessor_config)

# Generate transcription
results = model.generate(mel)
print(results[0].text)
Batch processing:
import mlx.core as mx
from parakeet_mlx.audio import get_logmel, load_audio

# Process multiple files
audio_files = ["audio1.wav", "audio2.wav", "audio3.wav"]

# Load all audio
audios = [load_audio(f, model.preprocessor_config.sample_rate) for f in audio_files]

# Convert to mel-spectrograms
mels = [get_logmel(a, model.preprocessor_config) for a in audios]

# Find max length for padding
max_len = max(m.shape[0] for m in mels)

# Pad and stack
padded = []
for mel in mels:
    pad_len = max_len - mel.shape[0]
    if pad_len > 0:
        padding = mx.zeros((pad_len, mel.shape[1]), dtype=mel.dtype)
        mel = mx.concatenate([mel, padding], axis=0)
    padded.append(mel)

batch_mel = mx.stack(padded)

# Generate transcriptions for all files at once
results = model.generate(batch_mel)

for i, result in enumerate(results):
    print(f"{audio_files[i]}: {result.text}")
With custom decoding:
from parakeet_mlx import DecodingConfig, Beam

config = DecodingConfig(
    decoding=Beam(beam_size=5, length_penalty=0.013)
)

results = model.generate(mel, decoding_config=config)

Configuration Properties

These properties provide access to model configuration:
model.preprocessor_config  # PreprocessArgs - audio preprocessing settings
model.encoder_config       # ConformerArgs - encoder configuration
Useful for:
  • Getting sample rate: model.preprocessor_config.sample_rate
  • Getting hop length: model.preprocessor_config.hop_length
  • Getting subsampling factor: model.encoder_config.subsampling_factor

Build docs developers (and LLMs) love