Skip to main content

Overview

BaseParakeet is the abstract base class that defines the common interface for all Parakeet model variants. It provides three core methods for transcription:
  • transcribe() - Transcribe audio files
  • transcribe_stream() - Real-time streaming transcription
  • generate() - Low-level mel-spectrogram to text
All model variants (ParakeetTDT, ParakeetRNNT, ParakeetCTC, ParakeetTDTCTC) inherit from this class.

Class Definition

class BaseParakeet(nn.Module):
    def __init__(self, preprocess_args: PreprocessArgs, encoder_args: ConformerArgs):
        ...

Properties

time_ratio

@property
float time_ratio
The time ratio between encoder output frames and input audio samples. Used internally to convert frame indices to timestamps. Formula:
time_ratio = (subsampling_factor / sample_rate) * hop_length

Methods

transcribe()

Transcribe an audio file with optional chunking for long files.
def transcribe(
    self,
    path: Path | str,
    *,
    dtype: mx.Dtype = mx.bfloat16,
    decoding_config: DecodingConfig = DecodingConfig(),
    chunk_duration: Optional[float] = None,
    overlap_duration: float = 15.0,
    chunk_callback: Optional[Callable] = None,
) -> AlignedResult

Parameters

path
Path | str
required
Path to the audio file. Supports WAV, MP3, FLAC, and other formats supported by audiofile.
dtype
mx.Dtype
default:"mx.bfloat16"
Data type for audio processing. Should match the model’s dtype.
decoding_config
DecodingConfig
default:"DecodingConfig()"
Configuration for decoding behavior and sentence splitting. See DecodingConfig.
chunk_duration
float | None
default:"None"
If provided, splits audio into chunks of this duration (in seconds). When None, processes the entire file at once.Use chunking for:
  • Very long audio files (> 5 minutes)
  • Memory-constrained environments
  • Processing audio that exceeds available RAM
overlap_duration
float
default:"15.0"
Overlap between consecutive chunks in seconds. Only used when chunk_duration is specified.Higher overlap improves accuracy at chunk boundaries but increases computation time.
chunk_callback
Callable | None
default:"None"
Callback function called after processing each chunk. Receives (current_position, total_length) in samples.Useful for progress tracking:
def progress(current, total):
    percent = (current / total) * 100
    print(f"Progress: {percent:.1f}%")

result = model.transcribe("audio.wav", chunk_callback=progress)

Returns

result
AlignedResult
Transcription result with aligned tokens and sentences. See AlignedResult.

Examples

Basic transcription:
from parakeet_mlx import from_pretrained

model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")
result = model.transcribe("interview.wav")

print(result.text)
for sentence in result.sentences:
    print(f"[{sentence.start:.2f}s - {sentence.end:.2f}s] {sentence.text}")
With chunking for long audio:
result = model.transcribe(
    "long_podcast.wav",
    chunk_duration=120.0,  # 2 minute chunks
    overlap_duration=15.0   # 15 second overlap
)
With custom decoding config:
from parakeet_mlx import DecodingConfig, Beam, SentenceConfig

config = DecodingConfig(
    decoding=Beam(beam_size=5, length_penalty=0.013),
    sentence=SentenceConfig(max_words=25, silence_gap=3.0)
)

result = model.transcribe("audio.wav", decoding_config=config)

transcribe_stream()

Create a streaming context for real-time transcription.
def transcribe_stream(
    self,
    context_size: tuple[int, int] = (256, 256),
    depth: int = 1,
    *,
    keep_original_attention: bool = False,
    decoding_config: DecodingConfig = DecodingConfig(),
) -> StreamingParakeet

Parameters

context_size
tuple[int, int]
default:"(256, 256)"
A pair (left_context, right_context) specifying attention context windows in encoder frames.
  • left_context: How many past frames to attend to
  • right_context: How many future frames to attend to (lookahead)
Larger contexts improve accuracy but increase latency and memory usage.
depth
int
default:"1"
Number of encoder layers that preserve exact computation across chunks.
  • depth=1 (default): Only first layer’s cache matches exactly
  • depth=2: First two layers match exactly
  • depth=N: All N layers match (full equivalence to non-streaming)
Higher depth increases accuracy but requires more memory for caching.
keep_original_attention
bool
default:"False"
Whether to preserve the original attention mechanism.
  • False (default): Switches to local attention for streaming
  • True: Keeps original attention (less suitable for streaming)
decoding_config
DecodingConfig
default:"DecodingConfig()"
Configuration for decoding behavior and sentence splitting.

Returns

streamer
StreamingParakeet
A context manager for streaming inference. Use with Python’s with statement.

Examples

Basic streaming:
import mlx.core as mx
from parakeet_mlx import from_pretrained
from parakeet_mlx.audio import load_audio

model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")

with model.transcribe_stream(context_size=(256, 256)) as stream:
    # Simulate real-time audio
    audio = load_audio("audio.wav", model.preprocessor_config.sample_rate)
    chunk_size = model.preprocessor_config.sample_rate  # 1 second chunks
    
    for i in range(0, len(audio), chunk_size):
        chunk = audio[i:i+chunk_size]
        stream.add_audio(chunk)
        
        # Get current transcription
        result = stream.result
        print(f"\rCurrent: {result.text}", end="")
    
    # Get final result
    final = stream.result
    print(f"\nFinal: {final.text}")
With custom depth and context:
with model.transcribe_stream(
    context_size=(512, 512),  # Larger context for better accuracy
    depth=3                    # Cache first 3 layers exactly
) as stream:
    # ... process audio ...
    pass
Accessing finalized vs draft tokens:
with model.transcribe_stream() as stream:
    stream.add_audio(audio_chunk)
    
    # Finalized tokens won't change
    print("Finalized:", [t.text for t in stream.finalized_tokens])
    
    # Draft tokens may change with new audio
    print("Draft:", [t.text for t in stream.draft_tokens])

generate()

Generate transcription from mel-spectrogram input. This is the low-level interface used by transcribe().
def generate(
    self,
    mel: mx.array,
    *,
    decoding_config: DecodingConfig = DecodingConfig(),
) -> list[AlignedResult]

Parameters

mel
mx.array
required
Mel-spectrogram input with shape:
  • [batch, sequence, mel_dim] for batch processing, or
  • [sequence, mel_dim] for single input
Generate mel-spectrograms using:
from parakeet_mlx.audio import get_logmel, load_audio

audio = load_audio("audio.wav", model.preprocessor_config.sample_rate)
mel = get_logmel(audio, model.preprocessor_config)
decoding_config
DecodingConfig
default:"DecodingConfig()"
Configuration object controlling decoding behavior and sentence splitting.

Returns

results
list[AlignedResult]
List of transcription results with aligned tokens and sentences, one for each input in the batch.

Examples

Single input:
import mlx.core as mx
from parakeet_mlx import from_pretrained
from parakeet_mlx.audio import get_logmel, load_audio

model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")

# Load and preprocess audio
audio = load_audio("audio.wav", model.preprocessor_config.sample_rate)
mel = get_logmel(audio, model.preprocessor_config)

# Generate transcription
results = model.generate(mel)
print(results[0].text)
Batch processing:
import mlx.core as mx
from parakeet_mlx.audio import get_logmel, load_audio

# Process multiple files
audio_files = ["audio1.wav", "audio2.wav", "audio3.wav"]

# Load all audio
audios = [load_audio(f, model.preprocessor_config.sample_rate) for f in audio_files]

# Convert to mel-spectrograms
mels = [get_logmel(a, model.preprocessor_config) for a in audios]

# Find max length for padding
max_len = max(m.shape[0] for m in mels)

# Pad and stack
padded = []
for mel in mels:
    pad_len = max_len - mel.shape[0]
    if pad_len > 0:
        padding = mx.zeros((pad_len, mel.shape[1]), dtype=mel.dtype)
        mel = mx.concatenate([mel, padding], axis=0)
    padded.append(mel)

batch_mel = mx.stack(padded)

# Generate transcriptions for all files at once
results = model.generate(batch_mel)

for i, result in enumerate(results):
    print(f"{audio_files[i]}: {result.text}")
With custom decoding:
from parakeet_mlx import DecodingConfig, Beam

config = DecodingConfig(
    decoding=Beam(beam_size=5, length_penalty=0.013)
)

results = model.generate(mel, decoding_config=config)

Configuration Properties

These properties provide access to model configuration:
model.preprocessor_config  # PreprocessArgs - audio preprocessing settings
model.encoder_config       # ConformerArgs - encoder configuration
Useful for:
  • Getting sample rate: model.preprocessor_config.sample_rate
  • Getting hop length: model.preprocessor_config.hop_length
  • Getting subsampling factor: model.encoder_config.subsampling_factor

Build docs developers (and LLMs) love