Overview
ParakeetCTC implements the CTC (Connectionist Temporal Classification) architecture, a simpler and faster alternative to transducer-based models.
Key features:
- Fastest inference among all Parakeet variants
- Simpler architecture (encoder + linear decoder)
- Frame-independent predictions
- Only supports greedy decoding
- No decoder hidden state required
- Good for batch processing of clear audio
Class Definition
Inherited Methods
ParakeetCTC inherits all methods from BaseParakeet:transcribe()- Transcribe audio filestranscribe_stream()- Real-time streaming transcriptiongenerate()- Low-level mel-spectrogram to text
CTC-Specific Methods
decode()
Low-level decoding method that converts encoder features to aligned tokens using CTC greedy decoding.CTC decode() has a simpler signature than TDT/RNNT - no
last_token or hidden_state parameters are needed.Parameters
Encoder output features with shape
[batch, sequence, feature_dim].Typically obtained from:Valid length of each sequence in the batch. Shape:
[batch].Unlike TDT/RNNT, this parameter is required for CTC decoding.Decoding configuration. Only greedy decoding is supported (config.decoding is not used for CTC).
Returns
List of token sequences, one per batch item. Each token includes:
id- Token ID in vocabularytext- Decoded textstart- Start time in secondsduration- Duration in seconds (span between token boundaries)confidence- Confidence score (0.0 to 1.0)
Examples
Basic CTC decoding:Decoding Algorithm
CTC Greedy Decoding
CTC uses a frame-independent decoding strategy: Process:- For each frame, select most likely token (argmax)
- Remove consecutive duplicates
- Remove blank tokens
- Merge adjacent identical tokens
- Compute token boundaries and confidence
- Token start: First frame where token appears
- Token end: Last frame before next different token
- Duration: Time span between start and end
- Computed using entropy-based method across token frames
- Lower entropy = higher confidence
- Formula:
confidence = 1.0 - (avg_entropy / max_entropy)
Model Properties
Architecture Details
CTC pipeline:-
Encoder: Converts mel-spectrogram to features
-
Decoder: Linear projection to vocabulary
The decoder is just:
- Optional convolutional layers
- Linear layer:
features → vocab_size + 1(including blank) - Log-softmax for probabilities
-
Decoding: Collapse and remove blanks
| Feature | CTC | TDT/RNNT |
|---|---|---|
| Architecture | Encoder + Linear | Encoder + Decoder RNN + Joint |
| Hidden state | None | LSTM (h, c) |
| Frame dependency | Independent | Dependent on history |
| Decoding speed | Fastest | Moderate |
| Accuracy | Good | Better |
| Streaming | Supported | Supported |
| Beam search | Not implemented | TDT only |
- No decoder RNN forward pass per frame
- No joint network computation
- Simple argmax + collapse operation
- Can be fully parallelized
Token Boundaries
CTC determines token boundaries by tracking when tokens change:Performance Tips
- Use for batch processing: CTC excels at processing many files at once
- Best for clear audio: CTC works well when audio quality is good
- Fastest option: Choose CTC when speed is critical
- No state management: Simpler to use than TDT/RNNT (no hidden states)
- Memory efficient: No decoder RNN means less memory usage
Streaming with CTC
While CTC supports streaming, it’s simpler than transducers:- No decoder state to track
- Each frame is predicted independently
- Simpler state management in streaming implementation
When to Use CTC
Choose CTC when:- Speed is the top priority
- Audio quality is good
- You’re doing batch processing
- You don’t need maximum accuracy
- Simpler architecture is preferred
- Memory is very constrained
- Accuracy is more important than speed
- Audio quality varies
- You need beam search (TDT)
- You need duration predictions (TDT)
- You want better handling of challenging audio
- CTC: ~2x faster than TDT greedy, ~10x faster than TDT beam
- CTC: 90-95% of TDT accuracy on clear audio
- CTC: Lower relative accuracy on noisy/accented audio
Limitations
- No beam search: Only greedy decoding is available
- Independence assumption: Each frame is predicted independently, missing some context
- Alignment quality: Can produce less precise alignments than transducers
- Challenging audio: Performance degrades more on noisy/accented audio compared to TDT
Related
- BaseParakeet - Common interface and methods
- ParakeetTDT - TDT variant (higher accuracy, beam search)
- ParakeetRNNT - RNNT variant (balance of speed and accuracy)
- DecodingConfig - Decoding configuration
- AlignedToken - Token structure
- Performance comparison - Choosing the right model