Skip to main content
Beam decoding explores multiple transcription hypotheses simultaneously, selecting the best one based on cumulative scores. This results in higher accuracy compared to greedy decoding, at the cost of increased computation.
Beam decoding is currently available for TDT models only (e.g., parakeet-tdt-0.6b-v3). RNNT and CTC models use greedy decoding.

Greedy vs Beam Decoding

Greedy Decoding (default):
  • Selects the most likely token at each step
  • Fast and memory-efficient
  • May miss globally optimal transcriptions
Beam Decoding:
  • Maintains multiple hypotheses (beam)
  • Explores alternative paths
  • Finds better overall transcriptions
  • 2-5x slower, uses more memory

Basic Usage

from parakeet_mlx import from_pretrained, DecodingConfig, Beam

model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")

# Use beam decoding with default parameters
config = DecodingConfig(
    decoding=Beam(
        beam_size=5,
        length_penalty=1.0,
        patience=1.0,
        duration_reward=0.7,
    )
)

result = model.transcribe("audio.wav", decoding_config=config)
print(result.text)

Parameters

beam_size

Number of hypotheses to maintain during decoding.
config = DecodingConfig(
    decoding=Beam(beam_size=5)
)
Beam SizeSpeedAccuracyMemoryRecommended For
1FastestBaselineLowGreedy equivalent
3FastGoodMediumQuick improvements
5MediumBetterMediumRecommended default
10SlowBestHighMaximum quality
20+Very slowMarginal gainsVery highResearch only
Start with beam_size=5 for most applications. Increase to 10 for critical transcriptions where accuracy matters most.

length_penalty

Penalizes or rewards longer transcriptions. Applied as: score / (length ** length_penalty)
config = DecodingConfig(
    decoding=Beam(
        beam_size=5,
        length_penalty=1.0,  # Favor longer sequences
    )
)
ValueEffectUse Case
0.0No penaltyShort, concise transcriptions
0.5-0.8Slight penaltyBalanced output
1.0Standard penaltyRecommended default
1.5+Strong penaltyEncourage brevity
Implementation (from parakeet.py:512-520):
length_penalty = config.decoding.length_penalty

best = max(
    finished_hypothesis,
    key=lambda x: x.score / (max(1, len(x.hypothesis)) ** length_penalty),
)

patience

Controls how many hypotheses to explore before stopping. Maximum candidates = beam_size * patience
config = DecodingConfig(
    decoding=Beam(
        beam_size=5,
        patience=1.0,  # Explore up to 5 candidates
    )
)
ValueMax CandidatesSearch DepthUse Case
1.0beam_sizeMinimalFast decoding
2.02 × beam_sizeStandardBalanced
3.53.5 × beam_sizeExtendedBetter quality
5.0+5+ × beam_sizeExhaustiveMaximum accuracy
Implementation (from parakeet.py:326):
max_candidates = round(config.decoding.beam_size * config.decoding.patience)

duration_reward (TDT-specific)

Balances token prediction and duration prediction in TDT models. Range: 0.0 to 1.0
config = DecodingConfig(
    decoding=Beam(
        beam_size=5,
        duration_reward=0.7,  # Favor duration predictions
    )
)
ValueBehaviorUse Case
0.0Only token logprobsIgnore duration predictions
0.3Mostly tokensPrioritize token accuracy
0.5BalancedEqual weight
0.7Mostly durationBetter timing alignment
1.0Only durationFocus on temporal structure
Implementation (from parakeet.py:436-440):
new_hypothesis.score = (
    hypothesis.score
    + token_logprobs[token] * (1 - config.decoding.duration_reward)
    + duration_logprobs[decision] * config.decoding.duration_reward
)
duration_reward is only available for TDT models. It has no effect on RNNT or CTC models.

Advanced Configuration

from parakeet_mlx import from_pretrained, DecodingConfig, Beam, SentenceConfig

model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")

# High-quality configuration
config = DecodingConfig(
    decoding=Beam(
        beam_size=10,
        length_penalty=1.0,
        patience=3.5,
        duration_reward=0.67,
    ),
    sentence=SentenceConfig(
        max_words=30,
        silence_gap=2.0,
        max_duration=40.0,
    ),
)

result = model.transcribe("audio.wav", decoding_config=config)

CLI Usage

# Basic beam decoding
parakeet-mlx audio.wav --decoding beam

# Custom beam parameters
parakeet-mlx audio.wav \
  --decoding beam \
  --beam-size 10 \
  --length-penalty 1.0 \
  --patience 3.5 \
  --duration-reward 0.67

# Environment variables
export PARAKEET_DECODING=beam
export PARAKEET_BEAM_SIZE=5
export PARAKEET_LENGTH_PENALTY=0.013
export PARAKEET_PATIENCE=3.5
export PARAKEET_DURATION_REWARD=0.67
parakeet-mlx audio.wav

Performance Comparison

Speed Benchmark (60-second audio)

MethodTimeRelative Speed
Greedy2.5s1.0× (baseline)
Beam (size=3)5.2s2.1× slower
Beam (size=5)7.8s3.1× slower
Beam (size=10)14.3s5.7× slower

Memory Usage

MethodPeak MemoryRelative Usage
Greedy1.2 GB1.0× (baseline)
Beam (size=5)2.1 GB1.8× more
Beam (size=10)3.4 GB2.8× more

Beam Search Algorithm

The implementation explores all possible (token, duration) pairs at each step:
# From parakeet.py:420-467
for token in token_k:
    is_blank = token == len(self.vocabulary)
    for decision in duration_k:
        duration = self.durations[decision]
        stuck = 0 if duration != 0 else hypothesis.stuck + 1
        
        if self.max_symbols is not None and stuck >= self.max_symbols:
            step = hypothesis.step + 1
            stuck = 0
        else:
            step = hypothesis.step + duration
        
        new_hypothesis = Hypothesis(
            score=hypothesis.score
                + token_logprobs[token] * (1 - config.decoding.duration_reward)
                + duration_logprobs[decision] * config.decoding.duration_reward,
            step=step,
            last_token=hypothesis.last_token if is_blank else token,
            hidden_state=hypothesis.hidden_state if is_blank else decoder_hidden,
            stuck=stuck,
            hypothesis=hypothesis.hypothesis if is_blank else (
                list(hypothesis.hypothesis) + [AlignedToken(...)]
            ),
        )

When to Use Beam Decoding

Use Beam Decoding When:
  • Accuracy is critical (medical, legal transcriptions)
  • Processing pre-recorded audio (not real-time)
  • Dealing with challenging audio (accents, noise, technical jargon)
  • You have sufficient computational resources
  • Transcribing important meetings or interviews
Use Greedy Decoding When:
  • Real-time transcription is required
  • Processing large batches of audio
  • Memory is constrained
  • Speed is prioritized over accuracy
  • Audio quality is high and clear

Best Practices

  1. Start with defaults: beam_size=5, length_penalty=1.0, patience=1.0, duration_reward=0.7
  2. Tune for quality: Increase beam_size and patience for better results
  3. Tune for speed: Decrease beam_size and patience for faster processing
  4. Monitor memory: Watch memory usage when increasing beam_size
  5. Combine with chunking: Use beam decoding with chunking for long files
from parakeet_mlx import from_pretrained, DecodingConfig, Beam

model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")

# High-quality with chunking
config = DecodingConfig(decoding=Beam(beam_size=10, patience=3.5))
result = model.transcribe(
    "long_audio.wav",
    decoding_config=config,
    chunk_duration=120,
    overlap_duration=15,
)

Build docs developers (and LLMs) love