Overview
ParakeetTDT implements the Token-and-Duration Transducer architecture, which jointly predicts both tokens and their durations. This is the recommended model variant for most use cases.
Key features:
- Simultaneous token and duration prediction
- Supports both greedy and beam search decoding
- Best accuracy among Parakeet variants
- Suitable for both real-time and offline transcription
Class Definition
Inherited Methods
ParakeetTDT inherits all methods from BaseParakeet:transcribe()- Transcribe audio filestranscribe_stream()- Real-time streaming transcriptiongenerate()- Low-level mel-spectrogram to text
TDT-Specific Methods
decode()
Low-level decoding method that converts encoder features to aligned tokens.Parameters
Encoder output features with shape
[batch, sequence, feature_dim].Typically obtained from:Valid length of each sequence in the batch. Shape:
[batch].If None, assumes all sequences have full length features.shape[1].Last predicted token ID for each batch item. Used for stateful decoding in streaming scenarios.
- Pass
Nonefor each batch item to start fresh - Pass token IDs from previous decode call for continuity
Hidden state (LSTM hidden and cell) for each batch item. Used for stateful decoding.
- Pass
Nonefor each batch item to start fresh - Pass states from previous decode call for continuity
Decoding configuration. TDT supports:
Greedy()- Fast greedy decodingBeam()- Beam search with configurable parameters
Returns
List of token sequences, one per batch item. Each token includes:
id- Token ID in vocabularytext- Decoded textstart- Start time in secondsduration- Duration in secondsconfidence- Confidence score (0.0 to 1.0)
Updated hidden states for each batch item. Pass these to subsequent
decode() calls for streaming.Examples
Basic decoding with greedy search:Decoding Algorithms
Greedy Decoding
Fast, single-pass decoding that selects the most likely token at each step. Characteristics:- Fastest inference
- Good accuracy for clear audio
- Deterministic output
- Low memory usage
Beam Search Decoding
Explores multiple hypotheses simultaneously for better accuracy. Characteristics:- Higher accuracy, especially for challenging audio
- Slower than greedy
- Non-deterministic (can vary slightly)
- Higher memory usage
Number of top hypotheses to maintain. Higher values improve accuracy but increase computation.Typical values: 3-10
Penalty applied based on sequence length. Helps prevent overly short or long predictions.
0.0- No penalty< 1.0- Favors shorter sequences> 1.0- Favors longer sequences
score / (sequence_length ** length_penalty)Controls when to stop searching. Search continues until
patience × beam_size complete hypotheses are found.1.0- Stop as soon as beam_size hypotheses complete> 1.0- Continue searching for potentially better hypotheses
Weight between token and duration predictions (TDT-specific).
0.0- Only use token logprobs1.0- Only use duration logprobs0.5- Equal weight< 0.5- Favor token predictions> 0.5- Favor duration predictions
token_logprob × (1 - duration_reward) + duration_logprob × duration_rewardModel Properties
Architecture Details
TDT decoding process:-
Encoder: Converts mel-spectrogram to features
-
Decoder: Predicts next token based on history
-
Joint: Combines encoder and decoder outputs
-
Decision: Extract token and duration predictions
-
Advance: Move forward by predicted duration
- Non-blank token: Update history, advance by duration
- Blank token: Advance by duration, keep same history
Performance Tips
- Use greedy for real-time: Greedy decoding is 3-5x faster than beam search
- Use beam for accuracy: Beam search improves WER by 5-15% on challenging audio
- Tune duration_reward: Adjust based on your audio characteristics
- Speech with clear pauses: higher values (0.7-0.8)
- Fast speech or music: lower values (0.5-0.6)
- Batch when possible: Process multiple files together for better GPU utilization
Related
- BaseParakeet - Common interface and methods
- ParakeetRNNT - RNNT variant (simpler, no duration prediction)
- ParakeetCTC - CTC variant (fastest, less accurate)
- DecodingConfig - Decoding configuration
- AlignedToken - Token structure
- Beam search guide - Detailed beam search tuning