BaseParakeet

Overview

BaseParakeet is the abstract base class that defines the common interface for all Parakeet model variants. It provides three core methods for transcription:

transcribe() - Transcribe audio files
transcribe_stream() - Real-time streaming transcription
generate() - Low-level mel-spectrogram to text

All model variants (ParakeetTDT, ParakeetRNNT, ParakeetCTC, ParakeetTDTCTC) inherit from this class.

Class Definition

class BaseParakeet(nn.Module):
    def __init__(self, preprocess_args: PreprocessArgs, encoder_args: ConformerArgs):
        ...

Properties

time_ratio

@property
float time_ratio

The time ratio between encoder output frames and input audio samples. Used internally to convert frame indices to timestamps. Formula:

time_ratio = (subsampling_factor / sample_rate) * hop_length

Methods

transcribe()

Transcribe an audio file with optional chunking for long files.

def transcribe(
    self,
    path: Path | str,
    *,
    dtype: mx.Dtype = mx.bfloat16,
    decoding_config: DecodingConfig = DecodingConfig(),
    chunk_duration: Optional[float] = None,
    overlap_duration: float = 15.0,
    chunk_callback: Optional[Callable] = None,
) -> AlignedResult

Parameters

path

Path | str

required

Path to the audio file. Supports WAV, MP3, FLAC, and other formats supported by audiofile.

dtype

mx.Dtype

default:"mx.bfloat16"

Data type for audio processing. Should match the model’s dtype.

decoding_config

DecodingConfig

default:"DecodingConfig()"

Configuration for decoding behavior and sentence splitting. See DecodingConfig.

chunk_duration

float | None

default:"None"

If provided, splits audio into chunks of this duration (in seconds). When None, processes the entire file at once.Use chunking for:

Very long audio files (> 5 minutes)
Memory-constrained environments
Processing audio that exceeds available RAM

overlap_duration

float

default:"15.0"

Overlap between consecutive chunks in seconds. Only used when chunk_duration is specified.Higher overlap improves accuracy at chunk boundaries but increases computation time.

chunk_callback

Callable | None

default:"None"

Callback function called after processing each chunk. Receives (current_position, total_length) in samples.Useful for progress tracking:

def progress(current, total):
    percent = (current / total) * 100
    print(f"Progress: {percent:.1f}%")

result = model.transcribe("audio.wav", chunk_callback=progress)

Returns

result

AlignedResult

Transcription result with aligned tokens and sentences. See AlignedResult.

Examples

Basic transcription:

from parakeet_mlx import from_pretrained

model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")
result = model.transcribe("interview.wav")

print(result.text)
for sentence in result.sentences:
    print(f"[{sentence.start:.2f}s - {sentence.end:.2f}s] {sentence.text}")

With chunking for long audio:

result = model.transcribe(
    "long_podcast.wav",
    chunk_duration=120.0,  # 2 minute chunks
    overlap_duration=15.0   # 15 second overlap
)

With custom decoding config:

from parakeet_mlx import DecodingConfig, Beam, SentenceConfig

config = DecodingConfig(
    decoding=Beam(beam_size=5, length_penalty=0.013),
    sentence=SentenceConfig(max_words=25, silence_gap=3.0)
)

result = model.transcribe("audio.wav", decoding_config=config)

transcribe_stream()

Create a streaming context for real-time transcription.

def transcribe_stream(
    self,
    context_size: tuple[int, int] = (256, 256),
    depth: int = 1,
    *,
    keep_original_attention: bool = False,
    decoding_config: DecodingConfig = DecodingConfig(),
) -> StreamingParakeet

Parameters

context_size

tuple[int, int]

default:"(256, 256)"

A pair (left_context, right_context) specifying attention context windows in encoder frames.

left_context: How many past frames to attend to
right_context: How many future frames to attend to (lookahead)

Larger contexts improve accuracy but increase latency and memory usage.

depth

int

default:"1"

Number of encoder layers that preserve exact computation across chunks.

depth=1 (default): Only first layer’s cache matches exactly
depth=2: First two layers match exactly
depth=N: All N layers match (full equivalence to non-streaming)

Higher depth increases accuracy but requires more memory for caching.

keep_original_attention

bool

default:"False"

Whether to preserve the original attention mechanism.

False (default): Switches to local attention for streaming
True: Keeps original attention (less suitable for streaming)

decoding_config

DecodingConfig

default:"DecodingConfig()"

Configuration for decoding behavior and sentence splitting.

Returns

streamer

StreamingParakeet

A context manager for streaming inference. Use with Python’s with statement.

Examples

Basic streaming:

import mlx.core as mx
from parakeet_mlx import from_pretrained
from parakeet_mlx.audio import load_audio

model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")

with model.transcribe_stream(context_size=(256, 256)) as stream:
    # Simulate real-time audio
    audio = load_audio("audio.wav", model.preprocessor_config.sample_rate)
    chunk_size = model.preprocessor_config.sample_rate  # 1 second chunks
    
    for i in range(0, len(audio), chunk_size):
        chunk = audio[i:i+chunk_size]
        stream.add_audio(chunk)
        
        # Get current transcription
        result = stream.result
        print(f"\rCurrent: {result.text}", end="")
    
    # Get final result
    final = stream.result
    print(f"\nFinal: {final.text}")

With custom depth and context:

with model.transcribe_stream(
    context_size=(512, 512),  # Larger context for better accuracy
    depth=3                    # Cache first 3 layers exactly
) as stream:
    # ... process audio ...
    pass

Accessing finalized vs draft tokens:

with model.transcribe_stream() as stream:
    stream.add_audio(audio_chunk)
    
    # Finalized tokens won't change
    print("Finalized:", [t.text for t in stream.finalized_tokens])
    
    # Draft tokens may change with new audio
    print("Draft:", [t.text for t in stream.draft_tokens])

generate()

Generate transcription from mel-spectrogram input. This is the low-level interface used by transcribe().

def generate(
    self,
    mel: mx.array,
    *,
    decoding_config: DecodingConfig = DecodingConfig(),
) -> list[AlignedResult]

Parameters

mel

mx.array

required

Mel-spectrogram input with shape:

[batch, sequence, mel_dim] for batch processing, or
[sequence, mel_dim] for single input

Generate mel-spectrograms using:

from parakeet_mlx.audio import get_logmel, load_audio

audio = load_audio("audio.wav", model.preprocessor_config.sample_rate)
mel = get_logmel(audio, model.preprocessor_config)

decoding_config

DecodingConfig

default:"DecodingConfig()"

Configuration object controlling decoding behavior and sentence splitting.

Returns

results

list[AlignedResult]

List of transcription results with aligned tokens and sentences, one for each input in the batch.

Examples

Single input:

import mlx.core as mx
from parakeet_mlx import from_pretrained
from parakeet_mlx.audio import get_logmel, load_audio

model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")

# Load and preprocess audio
audio = load_audio("audio.wav", model.preprocessor_config.sample_rate)
mel = get_logmel(audio, model.preprocessor_config)

# Generate transcription
results = model.generate(mel)
print(results[0].text)

Batch processing:

import mlx.core as mx
from parakeet_mlx.audio import get_logmel, load_audio

# Process multiple files
audio_files = ["audio1.wav", "audio2.wav", "audio3.wav"]

# Load all audio
audios = [load_audio(f, model.preprocessor_config.sample_rate) for f in audio_files]

# Convert to mel-spectrograms
mels = [get_logmel(a, model.preprocessor_config) for a in audios]

# Find max length for padding
max_len = max(m.shape[0] for m in mels)

# Pad and stack
padded = []
for mel in mels:
    pad_len = max_len - mel.shape[0]
    if pad_len > 0:
        padding = mx.zeros((pad_len, mel.shape[1]), dtype=mel.dtype)
        mel = mx.concatenate([mel, padding], axis=0)
    padded.append(mel)

batch_mel = mx.stack(padded)

# Generate transcriptions for all files at once
results = model.generate(batch_mel)

for i, result in enumerate(results):
    print(f"{audio_files[i]}: {result.text}")

With custom decoding:

from parakeet_mlx import DecodingConfig, Beam

config = DecodingConfig(
    decoding=Beam(beam_size=5, length_penalty=0.013)
)

results = model.generate(mel, decoding_config=config)

Configuration Properties

These properties provide access to model configuration:

model.preprocessor_config  # PreprocessArgs - audio preprocessing settings
model.encoder_config       # ConformerArgs - encoder configuration

Useful for:

Getting sample rate: model.preprocessor_config.sample_rate
Getting hop length: model.preprocessor_config.hop_length
Getting subsampling factor: model.encoder_config.subsampling_factor

from_pretrained - Loading models
ParakeetTDT - TDT-specific methods
ParakeetRNNT - RNNT-specific methods
ParakeetCTC - CTC-specific methods
DecodingConfig - Decoding configuration
AlignedResult - Result structure

Models

Configuration

Results

Audio Processing

Overview

Class Definition

Properties

time_ratio

Methods

transcribe()

Parameters

Returns

Examples

transcribe_stream()

Parameters

Returns

Examples

generate()

Parameters

Returns

Examples

Configuration Properties

Build docs developers (and LLMs) love

Models

Configuration

Results

Audio Processing

Documentation Index

​Overview

​Class Definition

​Properties

​time_ratio

​Methods

​transcribe()

​Parameters

​Returns

​Examples

​transcribe_stream()

​Parameters

​Returns

​Examples

​generate()

​Parameters

​Returns

​Examples

​Configuration Properties

​Related

Build docs developers (and LLMs) love

Overview

Class Definition

Properties

time_ratio

Methods

transcribe()

Parameters

Returns

Examples

transcribe_stream()

Parameters

Returns

Examples

generate()

Parameters

Returns

Examples

Configuration Properties

Related