Model Architectures - Parakeet MLX

Overview

Parakeet MLX implements four distinct ASR model architectures, each optimized for different use cases. All models share a common Conformer encoder but differ in their decoding strategies and output layers.

TDT

Token-and-Duration Transducer - Predicts both tokens and durations

RNNT

RNN Transducer - Classic transducer with frame-by-frame decoding

CTC

Connectionist Temporal Classification - Non-autoregressive decoding

TDT-CTC

Hybrid model with both TDT and auxiliary CTC decoders

Shared Architecture

All Parakeet models inherit from BaseParakeet and share these common components:

Conformer Encoder

The encoder transforms mel-spectrogram input into high-level acoustic features using the Conformer architecture:

Architecture
Parameters
Time Ratio

# From conformer.py:332-366
class Conformer(nn.Module):
    - Subsampling layer (downsampling by factor 4-8)
    - Positional encoding (relative or local)
    - N Conformer blocks (typically 17-24 layers)

Each ConformerBlock combines:

Feed-forward module (with SiLU activation)
Multi-head self-attention (with relative positional encoding)
Convolution module (depthwise separable convolution)
Layer normalization and residual connections

Key configuration from ConformerArgs:

Parameter	Description	Typical Value
`feat_in`	Input mel features	80
`n_layers`	Number of blocks	17-24
`d_model`	Hidden dimension	512
`n_heads`	Attention heads	8
`subsampling_factor`	Temporal reduction	4 or 8
`ff_expansion_factor`	FFN expansion	4
`conv_kernel_size`	Conv kernel size	31

The encoder’s temporal subsampling affects timestamp calculations:

# From parakeet.py:108-113
@property
def time_ratio(self) -> float:
    return (
        self.encoder_config.subsampling_factor
        / self.preprocessor_config.sample_rate
        * self.preprocessor_config.hop_length
    )

With subsampling_factor=8, sample_rate=16000, hop_length=160:

Each encoder frame = 8 * 160 / 16000 = 0.08 seconds (80ms)
This ratio converts frame indices to timestamps

Audio Preprocessing

Before encoding, audio undergoes:

Resampling to 16kHz
Log-mel spectrogram extraction (80 mel bins)
Normalization

Parakeet TDT

Token-and-Duration Transducer extends RNN-T by jointly predicting tokens and their durations, enabling more accurate timestamp alignment.

Architecture

Encoder

Conformer encoder produces acoustic features

features, lengths = self.encoder(mel)  # [B, S, d_model]

Prediction Network

LSTM-based network maintains language model state

# From rnnt.py:88-116
class PredictNetwork:
    - Embedding layer (vocab_size -> pred_hidden)
    - Multi-layer LSTM (maintains hidden state)

Joint Network

Combines encoder and predictor outputs

# From rnnt.py:119-155
joint_out = self.joint(enc_features, pred_features)
# Output: [batch, time, 1, vocab_size + 1 + num_durations]

The output has vocab_size + 1 for tokens (including blank) plus additional logits for duration predictions

Duration Modeling

TDT predicts discrete duration values

# From parakeet.py:276-277
self.durations = args.decoding.durations  # e.g., [0, 1, 2, 3, 4]

Duration 0: Emit token without advancing time
Duration 1-4: Emit token and advance by N frames

Decoding Process

TDT supports both greedy and beam search decoding:

Greedy Decoding (parakeet.py:526-618)

while step < length:
    # 1. Run prediction network
    decoder_out, hidden_state = self.decoder(last_token, hidden_state)
    
    # 2. Compute joint network output
    joint_out = self.joint(features[:, step:step+1], decoder_out)
    
    # 3. Sample token and duration
    token = argmax(joint_out[..., :vocab_size+1])
    duration = self.durations[argmax(joint_out[..., vocab_size+1:])]
    
    # 4. Emit non-blank tokens
    if token != blank:
        hypothesis.append(AlignedToken(
            id=token,
            start=step * time_ratio,
            duration=duration * time_ratio,
            confidence=confidence_score
        ))
        last_token = token
        hidden_state = new_hidden_state
    
    # 5. Advance time
    step += duration

Stuck Prevention: If duration=0 for max_symbols consecutive steps, force step += 1

Beam Search Decoding (parakeet.py:313-524)

Beam search explores multiple hypotheses in parallel:

# Key parameters
beam_size: int = 5           # Top-K hypotheses to maintain
length_penalty: float = 1.0  # Normalize by length^penalty
patience: float = 1.0        # Keep patience*beam_size candidates
duration_reward: float = 0.7 # Weight for duration vs token logprobs

Each hypothesis tracks:

score: Combined log probability
step: Current time position
last_token: Previous emitted token
hidden_state: Decoder LSTM state
hypothesis: List of AlignedToken

Scoring combines token and duration predictions:

score = (
    token_logprob * (1 - duration_reward) +
    duration_logprob * duration_reward
)

Configuration

from parakeet_mlx import ParakeetTDTArgs, DecodingConfig, Beam

args = ParakeetTDTArgs(
    preprocessor=PreprocessArgs(...),
    encoder=ConformerArgs(...),
    decoder=PredictArgs(...),
    joint=JointArgs(...),
    decoding=TDTDecodingArgs(
        model_type="tdt",
        durations=[0, 1, 2, 3, 4],
        greedy={"max_symbols": 10}
    )
)

Parakeet RNNT

RNN Transducer uses the classic transducer formulation where each token has an implicit duration of 1 frame.

Key Differences from TDT

No Duration Prediction

Joint network outputs only vocab_size + 1 logits (no duration head)

Fixed Duration

All tokens have duration = 1 * time_ratio

Greedy Only

Currently only supports greedy decoding (beam search not implemented)

Simpler Decoding

Blank advances time by 1, non-blank emits token with duration=1

Decoding Logic

# From parakeet.py:655-743
while step < length:
    decoder_out, hidden_state = self.decoder(last_token, hidden_state)
    joint_out = self.joint(features[:, step:step+1], decoder_out)
    
    pred_token = argmax(joint_out[0, 0])  # No duration dimension
    
    if pred_token != blank:
        hypothesis.append(AlignedToken(
            id=pred_token,
            start=step * time_ratio,
            duration=1 * time_ratio,  # Fixed duration
            confidence=confidence
        ))
        last_token = pred_token
        hidden_state = new_hidden_state
        # Don't advance step yet (emit multiple tokens at same time)
    else:
        step += 1  # Blank advances time

RNNT can get stuck emitting tokens without advancing. The max_symbols parameter forces step += 1 after too many consecutive non-blank tokens.

Configuration

args = ParakeetRNNTArgs(
    preprocessor=PreprocessArgs(...),
    encoder=ConformerArgs(...),
    decoder=PredictArgs(...),
    joint=JointArgs(...),
    decoding=RNNTDecodingArgs(
        greedy={"max_symbols": 10}
    )
)

Parakeet CTC

Connectionist Temporal Classification uses a simpler, non-autoregressive approach without a prediction network.

Architecture

Encoder

Same Conformer encoder as other models

CTC Decoder

Simple 1D convolution projects encoder features to vocabulary

# From ctc.py:19-34
class ConvASRDecoder:
    def __init__(self, args):
        self.decoder_layers = [
            nn.Conv1d(feat_in, num_classes, kernel_size=1, bias=True)
        ]
    
    def __call__(self, x):
        return nn.log_softmax(
            self.decoder_layers[0](x) / self.temperature
        )

CTC Decoding

Collapse repeated tokens and remove blanks

Decoding Process

# From parakeet.py:774-889
logits = self.decoder(features)  # [B, S, vocab_size+1]
predictions = argmax(logits, axis=1)  # Greedy per-frame prediction

# Collapse repetitions and remove blanks
hypothesis = []
prev_token = -1
for t, token_id in enumerate(predictions):
    if token_id == blank:  # Skip blank
        continue
    if token_id == prev_token:  # Skip repetition
        continue
    
    # Token boundary detected
    hypothesis.append(AlignedToken(
        id=token_id,
        start=token_start * time_ratio,
        duration=(t - token_start) * time_ratio,
        confidence=avg_confidence_over_frames
    ))
    prev_token = token_id

Advantages & Limitations

Advantages
Limitations

Fast: No autoregressive decoding loop
Simple: No hidden state management
Parallelizable: All predictions are independent
Memory efficient: Smaller model (no prediction network)

Configuration

args = ParakeetCTCArgs(
    preprocessor=PreprocessArgs(...),
    encoder=ConformerArgs(...),
    decoder=ConvASRDecoderArgs(
        feat_in=512,
        num_classes=1024,  # Set to -1 to use len(vocabulary)
        vocabulary=vocabulary
    ),
    decoding=CTCDecodingArgs(
        greedy={}  # CTC only supports greedy
    )
)

Parakeet TDT-CTC

Hybrid architecture that combines TDT with an auxiliary CTC decoder for multi-task learning.

Architecture

# From parakeet.py:909-917
class ParakeetTDTCTC(ParakeetTDT):
    """Has ConvASRDecoder in .ctc_decoder but .generate uses TDT decoder"""
    
    def __init__(self, args: ParakeetTDTCTCArgs):
        super().__init__(args)  # Initialize TDT components
        self.ctc_decoder = ConvASRDecoder(args.aux_ctc.decoder)

During training, both decoders are used (multi-task learning). During inference, only the TDT decoder is used by default via .generate().

Configuration

args = ParakeetTDTCTCArgs(
    # All TDT parameters
    preprocessor=PreprocessArgs(...),
    encoder=ConformerArgs(...),
    decoder=PredictArgs(...),
    joint=JointArgs(...),
    decoding=TDTDecodingArgs(...),
    # Additional CTC decoder
    aux_ctc=AuxCTCArgs(
        decoder=ConvASRDecoderArgs(...)
    )
)

Model Comparison

Feature	TDT	RNNT	CTC	TDT-CTC
Decoding	Autoregressive	Autoregressive	Non-autoregressive	Autoregressive (inference)
Beam Search	✅ Yes	❌ No	❌ No	✅ Yes
Duration Modeling	✅ Explicit	❌ Fixed (1 frame)	⚠️ Inferred	✅ Explicit
Timestamp Accuracy	⭐⭐⭐ High	⭐⭐ Medium	⭐⭐ Medium	⭐⭐⭐ High
Speed	⭐⭐ Medium	⭐⭐ Medium	⭐⭐⭐ Fast	⭐⭐ Medium
WER	⭐⭐⭐ Best	⭐⭐ Good	⭐ Fair	⭐⭐⭐ Best
Language Model	✅ Internal	✅ Internal	❌ None	✅ Internal

Choosing a Model

Use TDT when...

You need accurate word-level timestamps
Quality is more important than speed
You have sufficient compute for beam search
Recommended: mlx-community/parakeet-tdt-0.6b-v3

Use RNNT when...

You need balanced accuracy and speed
Greedy decoding is sufficient
Model size is a concern
Recommended: mlx-community/parakeet-rnnt-1.1b

Use CTC when...

Speed is the top priority
You’re processing long audio files
Approximate timestamps are acceptable
Recommended: mlx-community/parakeet-ctc-1.1b

Use TDT-CTC when...

You want the best of both worlds
Training with multi-task learning
Maximum accuracy is required
Note: Uses TDT decoder for inference

Next Steps

Decoding Strategies

Learn about greedy vs beam search decoding

Timestamp System

Understand how timestamps are computed

Get Started

Core Concepts

Guides

Advanced

Documentation Index

​Overview

TDT

RNNT

CTC

TDT-CTC

​Shared Architecture

​Conformer Encoder

​Audio Preprocessing

​Parakeet TDT

​Architecture

​Decoding Process

​Configuration

​Parakeet RNNT

​Key Differences from TDT

No Duration Prediction

Fixed Duration

Greedy Only

Simpler Decoding

​Decoding Logic

​Configuration

​Parakeet CTC

​Architecture

​Decoding Process

​Advantages & Limitations

​Configuration

​Parakeet TDT-CTC

​Architecture

​Configuration

​Model Comparison

​Choosing a Model

Use TDT when...

Use RNNT when...

Use CTC when...

Use TDT-CTC when...

​Next Steps

Decoding Strategies

Timestamp System

Build docs developers (and LLMs) love

Overview

Shared Architecture

Conformer Encoder

Audio Preprocessing

Parakeet TDT

Architecture

Decoding Process

Configuration

Parakeet RNNT

Key Differences from TDT

Decoding Logic

Configuration

Parakeet CTC

Architecture

Decoding Process

Advantages & Limitations

Configuration

Parakeet TDT-CTC

Architecture

Configuration

Model Comparison

Choosing a Model

Next Steps