Beam decoding explores multiple transcription hypotheses simultaneously, selecting the best one based on cumulative scores. This results in higher accuracy compared to greedy decoding, at the cost of increased computation.
Beam decoding is currently available for TDT models only (e.g., parakeet-tdt-0.6b-v3). RNNT and CTC models use greedy decoding.
Greedy vs Beam Decoding
Greedy Decoding (default):
- Selects the most likely token at each step
- Fast and memory-efficient
- May miss globally optimal transcriptions
Beam Decoding:
- Maintains multiple hypotheses (beam)
- Explores alternative paths
- Finds better overall transcriptions
- 2-5x slower, uses more memory
Basic Usage
from parakeet_mlx import from_pretrained, DecodingConfig, Beam
model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")
# Use beam decoding with default parameters
config = DecodingConfig(
decoding=Beam(
beam_size=5,
length_penalty=1.0,
patience=1.0,
duration_reward=0.7,
)
)
result = model.transcribe("audio.wav", decoding_config=config)
print(result.text)
Parameters
beam_size
Number of hypotheses to maintain during decoding.
config = DecodingConfig(
decoding=Beam(beam_size=5)
)
| Beam Size | Speed | Accuracy | Memory | Recommended For |
|---|
| 1 | Fastest | Baseline | Low | Greedy equivalent |
| 3 | Fast | Good | Medium | Quick improvements |
| 5 | Medium | Better | Medium | Recommended default |
| 10 | Slow | Best | High | Maximum quality |
| 20+ | Very slow | Marginal gains | Very high | Research only |
Start with beam_size=5 for most applications. Increase to 10 for critical transcriptions where accuracy matters most.
length_penalty
Penalizes or rewards longer transcriptions. Applied as: score / (length ** length_penalty)
config = DecodingConfig(
decoding=Beam(
beam_size=5,
length_penalty=1.0, # Favor longer sequences
)
)
| Value | Effect | Use Case |
|---|
| 0.0 | No penalty | Short, concise transcriptions |
| 0.5-0.8 | Slight penalty | Balanced output |
| 1.0 | Standard penalty | Recommended default |
| 1.5+ | Strong penalty | Encourage brevity |
Implementation (from parakeet.py:512-520):
length_penalty = config.decoding.length_penalty
best = max(
finished_hypothesis,
key=lambda x: x.score / (max(1, len(x.hypothesis)) ** length_penalty),
)
patience
Controls how many hypotheses to explore before stopping. Maximum candidates = beam_size * patience
config = DecodingConfig(
decoding=Beam(
beam_size=5,
patience=1.0, # Explore up to 5 candidates
)
)
| Value | Max Candidates | Search Depth | Use Case |
|---|
| 1.0 | beam_size | Minimal | Fast decoding |
| 2.0 | 2 × beam_size | Standard | Balanced |
| 3.5 | 3.5 × beam_size | Extended | Better quality |
| 5.0+ | 5+ × beam_size | Exhaustive | Maximum accuracy |
Implementation (from parakeet.py:326):
max_candidates = round(config.decoding.beam_size * config.decoding.patience)
duration_reward (TDT-specific)
Balances token prediction and duration prediction in TDT models. Range: 0.0 to 1.0
config = DecodingConfig(
decoding=Beam(
beam_size=5,
duration_reward=0.7, # Favor duration predictions
)
)
| Value | Behavior | Use Case |
|---|
| 0.0 | Only token logprobs | Ignore duration predictions |
| 0.3 | Mostly tokens | Prioritize token accuracy |
| 0.5 | Balanced | Equal weight |
| 0.7 | Mostly duration | Better timing alignment |
| 1.0 | Only duration | Focus on temporal structure |
Implementation (from parakeet.py:436-440):
new_hypothesis.score = (
hypothesis.score
+ token_logprobs[token] * (1 - config.decoding.duration_reward)
+ duration_logprobs[decision] * config.decoding.duration_reward
)
duration_reward is only available for TDT models. It has no effect on RNNT or CTC models.
Advanced Configuration
from parakeet_mlx import from_pretrained, DecodingConfig, Beam, SentenceConfig
model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")
# High-quality configuration
config = DecodingConfig(
decoding=Beam(
beam_size=10,
length_penalty=1.0,
patience=3.5,
duration_reward=0.67,
),
sentence=SentenceConfig(
max_words=30,
silence_gap=2.0,
max_duration=40.0,
),
)
result = model.transcribe("audio.wav", decoding_config=config)
CLI Usage
# Basic beam decoding
parakeet-mlx audio.wav --decoding beam
# Custom beam parameters
parakeet-mlx audio.wav \
--decoding beam \
--beam-size 10 \
--length-penalty 1.0 \
--patience 3.5 \
--duration-reward 0.67
# Environment variables
export PARAKEET_DECODING=beam
export PARAKEET_BEAM_SIZE=5
export PARAKEET_LENGTH_PENALTY=0.013
export PARAKEET_PATIENCE=3.5
export PARAKEET_DURATION_REWARD=0.67
parakeet-mlx audio.wav
Speed Benchmark (60-second audio)
| Method | Time | Relative Speed |
|---|
| Greedy | 2.5s | 1.0× (baseline) |
| Beam (size=3) | 5.2s | 2.1× slower |
| Beam (size=5) | 7.8s | 3.1× slower |
| Beam (size=10) | 14.3s | 5.7× slower |
Memory Usage
| Method | Peak Memory | Relative Usage |
|---|
| Greedy | 1.2 GB | 1.0× (baseline) |
| Beam (size=5) | 2.1 GB | 1.8× more |
| Beam (size=10) | 3.4 GB | 2.8× more |
Beam Search Algorithm
The implementation explores all possible (token, duration) pairs at each step:
# From parakeet.py:420-467
for token in token_k:
is_blank = token == len(self.vocabulary)
for decision in duration_k:
duration = self.durations[decision]
stuck = 0 if duration != 0 else hypothesis.stuck + 1
if self.max_symbols is not None and stuck >= self.max_symbols:
step = hypothesis.step + 1
stuck = 0
else:
step = hypothesis.step + duration
new_hypothesis = Hypothesis(
score=hypothesis.score
+ token_logprobs[token] * (1 - config.decoding.duration_reward)
+ duration_logprobs[decision] * config.decoding.duration_reward,
step=step,
last_token=hypothesis.last_token if is_blank else token,
hidden_state=hypothesis.hidden_state if is_blank else decoder_hidden,
stuck=stuck,
hypothesis=hypothesis.hypothesis if is_blank else (
list(hypothesis.hypothesis) + [AlignedToken(...)]
),
)
When to Use Beam Decoding
Use Beam Decoding When:
- Accuracy is critical (medical, legal transcriptions)
- Processing pre-recorded audio (not real-time)
- Dealing with challenging audio (accents, noise, technical jargon)
- You have sufficient computational resources
- Transcribing important meetings or interviews
Use Greedy Decoding When:
- Real-time transcription is required
- Processing large batches of audio
- Memory is constrained
- Speed is prioritized over accuracy
- Audio quality is high and clear
Best Practices
- Start with defaults:
beam_size=5, length_penalty=1.0, patience=1.0, duration_reward=0.7
- Tune for quality: Increase
beam_size and patience for better results
- Tune for speed: Decrease
beam_size and patience for faster processing
- Monitor memory: Watch memory usage when increasing
beam_size
- Combine with chunking: Use beam decoding with chunking for long files
from parakeet_mlx import from_pretrained, DecodingConfig, Beam
model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")
# High-quality with chunking
config = DecodingConfig(decoding=Beam(beam_size=10, patience=3.5))
result = model.transcribe(
"long_audio.wav",
decoding_config=config,
chunk_duration=120,
overlap_duration=15,
)