Overview
ParakeetRNNT implements the RNN-Transducer architecture, a simpler variant than TDT that predicts only tokens (with fixed duration of 1 frame per token).
Key features:
- Standard RNN-T architecture
- Simpler than TDT (no duration prediction)
- Currently supports only greedy decoding
- Good balance between speed and accuracy
- Suitable for streaming applications
Class Definition
Inherited Methods
ParakeetRNNT inherits all methods from BaseParakeet:transcribe()- Transcribe audio filestranscribe_stream()- Real-time streaming transcriptiongenerate()- Low-level mel-spectrogram to text
RNNT-Specific Methods
decode()
Low-level decoding method that converts encoder features to aligned tokens using greedy decoding.Currently only greedy decoding is supported for RNNT models. Passing
Beam() in config will raise an assertion error.Parameters
Encoder output features with shape
[batch, sequence, feature_dim].Typically obtained from:Valid length of each sequence in the batch. Shape:
[batch].If None, assumes all sequences have full length features.shape[1].Last predicted token ID for each batch item. Used for stateful decoding in streaming scenarios.
- Pass
Nonefor each batch item to start fresh - Pass token IDs from previous decode call for continuity
Hidden state (LSTM hidden and cell) for each batch item. Used for stateful decoding.
- Pass
Nonefor each batch item to start fresh - Pass states from previous decode call for continuity
Decoding configuration. Must use
Greedy() - beam search is not yet supported.Returns
List of token sequences, one per batch item. Each token includes:
id- Token ID in vocabularytext- Decoded textstart- Start time in secondsduration- Duration in seconds (always equalstime_ratiofor RNNT)confidence- Confidence score (0.0 to 1.0)
Updated hidden states for each batch item. Pass these to subsequent
decode() calls for streaming.Examples
Basic greedy decoding:Decoding Algorithm
Greedy Decoding
RNNT currently supports only greedy decoding: Characteristics:- Fast, single-pass decoding
- Deterministic output
- Low memory usage
- Good accuracy for clear audio
-
For each encoder frame:
- Get decoder prediction for current history
- Compute joint output
- Select most likely token (argmax)
- If non-blank: Emit token, update history, continue
- If blank: Advance to next frame
- Each non-blank token gets duration of 1 frame
- Stuck prevention: If too many non-blank emissions without advancing, force advance
- No duration prediction (always 1 frame per token)
- Simpler joint output (only token logits)
- No duration_reward parameter
- Beam search not yet implemented
Model Properties
Architecture Details
RNNT decoding process:-
Encoder: Converts mel-spectrogram to features
-
Decoder: Predicts next token based on history
-
Joint: Combines encoder and decoder outputs
-
Decision: Extract token prediction
-
Advance:
- Non-blank token: Emit token (duration=1), update history, stay on frame
- Blank token: Advance to next frame, keep same history
| Feature | RNNT | TDT |
|---|---|---|
| Token prediction | ✓ | ✓ |
| Duration prediction | ✗ | ✓ |
| Beam search | ✗ | ✓ |
| Greedy decoding | ✓ | ✓ |
| Streaming support | ✓ | ✓ |
| Speed | Fast | Moderate |
| Accuracy | Good | Better |
Max Symbols Prevention
To prevent the model from getting stuck emitting non-blank tokens without advancing:max_symbols consecutive non-blank tokens are emitted:
- Force advance to next frame
- Reset emission counter
- Continue decoding
Performance Tips
- Use for streaming: RNNT’s simpler architecture makes it well-suited for real-time streaming
- Batch processing: Process multiple files together for better throughput
- State management: Carefully manage
last_tokenandhidden_statefor streaming - Memory efficiency: RNNT uses less memory than TDT (no duration prediction)
When to Use RNNT
Choose RNNT when:- You need streaming transcription
- You want simpler architecture
- Speed is more important than maximum accuracy
- You don’t need beam search
- Memory is constrained
- You need maximum accuracy
- You want beam search capability
- You can accept slightly slower inference
- Duration prediction is valuable for your use case
Related
- BaseParakeet - Common interface and methods
- ParakeetTDT - TDT variant (with duration prediction and beam search)
- ParakeetCTC - CTC variant (fastest, different architecture)
- DecodingConfig - Decoding configuration
- AlignedToken - Token structure
- Streaming guide - Real-time transcription patterns