Skip to main content

Overview

Parakeet MLX provides word-level timestamp alignment, tracking when each token starts and ends in the audio. The alignment system produces hierarchical results with token and sentence boundaries.

Tokens

Word-level with start/end/duration

Sentences

Auto-segmented with aggregate timing

Results

Complete transcription with hierarchy

Data Structures

AlignedToken

Represents a single word or subword token with precise timing:
# From alignment.py:6-16
@dataclass
class AlignedToken:
    id: int              # Token ID in vocabulary
    text: str            # Decoded text (e.g., " hello")
    start: float         # Start time in seconds
    duration: float      # Duration in seconds
    confidence: float    # Confidence score (0.0 to 1.0)
    end: float = 0.0     # Computed as start + duration
    
    def __post_init__(self):
        self.end = self.start + self.duration
id
int
Token ID from the model’s vocabulary (0 to vocab_size-1)
text
str
Decoded text representation. May include leading/trailing spaces for subword tokens.Examples:
  • " Hello" - word with leading space
  • "world" - word without space
  • "." - punctuation
start
float
Start time in seconds from beginning of audio
duration
float
Token duration in seconds. Computed differently by each model:
  • TDT: Explicitly predicted (0-4 frames typically)
  • RNNT: Fixed at 1 frame
  • CTC: Inferred from frame repetitions
confidence
float
Confidence score from 0.0 (uncertain) to 1.0 (certain), computed using entropy:
confidence = 1.0 - (entropy / max_entropy)

AlignedSentence

Groups tokens into sentence-level segments:
# From alignment.py:19-35
@dataclass
class AlignedSentence:
    text: str                    # Full sentence text
    tokens: list[AlignedToken]   # Constituent tokens
    start: float = 0.0           # Sentence start (from first token)
    end: float = 0.0             # Sentence end (from last token)
    duration: float = 0.0        # Sentence duration
    confidence: float = 1.0      # Aggregate confidence
    
    def __post_init__(self):
        self.tokens = list(sorted(self.tokens, key=lambda x: x.start))
        self.start = self.tokens[0].start
        self.end = self.tokens[-1].end
        self.duration = self.end - self.start
        
        # Geometric mean of token confidences
        confidences = np.array([t.confidence for t in self.tokens])
        self.confidence = float(np.exp(np.mean(np.log(confidences + 1e-10))))
Sentence confidence uses geometric mean instead of arithmetic mean. This makes the score more sensitive to low-confidence tokens - if any token has very low confidence, the sentence confidence will be low.

AlignedResult

Top-level transcription result:
# From alignment.py:38-48
@dataclass
class AlignedResult:
    text: str                        # Full transcription
    sentences: list[AlignedSentence] # Sentence segments
    
    def __post_init__(self):
        self.text = self.text.strip()
    
    @property
    def tokens(self) -> list[AlignedToken]:
        # Flatten all tokens from all sentences
        return [token for sentence in self.sentences for token in sentence.tokens]

Time Ratio Calculation

Timestamps are computed by converting encoder frame indices to seconds using the time ratio:
# From parakeet.py:108-113
@property
def time_ratio(self) -> float:
    return (
        self.encoder_config.subsampling_factor
        / self.preprocessor_config.sample_rate
        * self.preprocessor_config.hop_length
    )
time_ratio = (subsampling_factor * hop_length) / sample_rate

timestamp = frame_index * time_ratio
Why this works:
  1. Audio is converted to mel frames with hop_length samples between frames
  2. Encoder subsamples mel frames by subsampling_factor (typically 8)
  3. Each encoder frame represents subsampling_factor * hop_length audio samples
  4. Divide by sample_rate to convert samples to seconds

Model-Specific Alignment

TDT Alignment

TDT explicitly predicts token durations during decoding:
# From parakeet.py:592-600
hypothesis.append(
    AlignedToken(
        int(pred_token),
        start=step * self.time_ratio,                    # Current frame
        duration=self.durations[decision] * self.time_ratio,  # Predicted duration
        confidence=confidence,
        text=tokenizer.decode([pred_token], self.vocabulary),
    )
)
TDT models predict discrete durations from a fixed set:
# From parakeet.py:277
self.durations = args.decoding.durations  # Typically [0, 1, 2, 3, 4]
DurationMeaningEncoder FramesTime (at 80ms/frame)
0Emit without advancing00ms
1Short sound180ms
2Medium sound2160ms
3Long sound3240ms
4Very long sound4320ms
Duration 0 allows emitting multiple tokens at the same timestamp, useful for compound words or fast speech.

RNNT Alignment

RNNT assigns fixed duration of 1 frame to all tokens:
# From parakeet.py:720-728
hypothesis.append(
    AlignedToken(
        int(pred_token),
        start=step * self.time_ratio,
        duration=1 * self.time_ratio,  # Always 1 frame
        confidence=confidence,
        text=tokenizer.decode([pred_token], self.vocabulary),
    )
)
RNNT timestamps are less accurate than TDT because:
  1. All tokens have the same duration regardless of actual length
  2. Multiple tokens can be emitted at the same frame (step doesn’t advance)
  3. Alignment depends on when the model emits blank tokens

CTC Alignment

CTC infers token boundaries from frame-level predictions:
# From parakeet.py:798-838 (simplified)
predictions = argmax(logits, axis=1)  # Per-frame predictions

hypothesis = []
prev_token = -1
token_boundaries = []

for t, token_id in enumerate(predictions):
    if token_id == blank:  # Skip blank frames
        continue
    if token_id == prev_token:  # Skip repetitions
        continue
    
    # New token boundary detected
    if prev_token != -1:
        # Finalize previous token
        start_frame = token_boundaries[-1][0]
        end_frame = t
        
        hypothesis.append(AlignedToken(
            prev_token,
            start=start_frame * time_ratio,
            duration=(end_frame - start_frame) * time_ratio,
            confidence=avg_confidence_over_frames,
            text=decode([prev_token])
        ))
    
    token_boundaries.append((t, None))
    prev_token = token_id
CTC collapses repetitions to find token boundaries:
Frame:  0    1    2    3    4    5    6    7
Pred:   H    H    _    e    l    l    _    o

After collapsing:
- "H" at frame 0 (duration: 2 frames, 0-2)
- "e" at frame 3 (duration: 1 frame, 3-4)
- "l" at frame 4 (duration: 2 frames, 4-6)
- "o" at frame 7 (duration: 1+ frames, 7-end)
Token boundaries are detected when:
  1. Token ID changes from previous frame
  2. Current token is not blank

Sentence Segmentation

Tokens are grouped into sentences using configurable rules:

SentenceConfig

# From alignment.py:51-55
@dataclass
class SentenceConfig:
    max_words: int | None = None        # Split after N words
    silence_gap: float | None = None    # Split after N seconds of silence
    max_duration: float | None = None   # Split after N seconds total
max_words
int | None
default:"None"
Maximum number of words per sentence. A “word” is detected by checking for space in token text.
# From alignment.py:79-86
is_word_limit = (
    (config.max_words is not None)
    and (idx != len(tokens) - 1)
    and (
        len([x for x in current_tokens if " " in x.text])
        + (1 if " " in tokens[idx + 1].text else 0)
        > config.max_words
    )
)
Set max_words=30 for subtitle-style segmentation
silence_gap
float | None
default:"None"
Split sentences if silence between tokens exceeds this duration (in seconds).
# From alignment.py:88-92
is_long_silence = (
    (config.silence_gap is not None)
    and (idx != len(tokens) - 1)
    and (tokens[idx + 1].start - token.end >= config.silence_gap)
)
Useful for detecting natural pauses in speech.
max_duration
float | None
default:"None"
Split sentences after this many seconds of audio (regardless of content).
# From alignment.py:93-95
is_over_duration = (
    (config.max_duration is not None) and
    (token.end - current_tokens[0].start >= config.max_duration)
)
Prevents excessively long sentence segments.

Example Usage

from parakeet_mlx import from_pretrained, DecodingConfig, SentenceConfig

model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")

# Default: split only on punctuation
result = model.transcribe("audio.wav")

# Split on punctuation OR every 30 words
result = model.transcribe(
    "audio.wav",
    decoding_config=DecodingConfig(
        sentence=SentenceConfig(max_words=30)
    )
)

# Split on punctuation OR 2+ seconds of silence
result = model.transcribe(
    "audio.wav",
    decoding_config=DecodingConfig(
        sentence=SentenceConfig(silence_gap=2.0)
    )
)

# Multiple conditions (split on any)
result = model.transcribe(
    "audio.wav",
    decoding_config=DecodingConfig(
        sentence=SentenceConfig(
            max_words=30,
            silence_gap=2.0,
            max_duration=10.0
        )
    )
)

Chunk Merging

When transcribing long audio with chunking, overlapping regions are merged intelligently:

Merge Strategies

Primary merge strategy that finds the longest contiguous matching subsequence:
# From alignment.py:116-194
def merge_longest_contiguous(
    a: list[AlignedToken],
    b: list[AlignedToken],
    overlap_duration: float
):
    # 1. Extract overlapping regions
    overlap_a = [token for token in a if token.end > b_start - overlap_duration]
    overlap_b = [token for token in b if token.start < a_end + overlap_duration]
    
    # 2. Find longest contiguous match
    best_contiguous = []
    for i in range(len(overlap_a)):
        for j in range(len(overlap_b)):
            if overlap_a[i].id == overlap_b[j].id and \
               abs(overlap_a[i].start - overlap_b[j].start) < overlap_duration / 2:
                # Extend match as far as possible
                current = []
                k, l = i, j
                while k < len(overlap_a) and l < len(overlap_b) and \
                      overlap_a[k].id == overlap_b[l].id:
                    current.append((k, l))
                    k += 1
                    l += 1
                
                if len(current) > len(best_contiguous):
                    best_contiguous = current
    
    # 3. Merge using contiguous sequence as anchor
    # Keep prefix from a, matched region, suffix from b
    result = a[:match_start] + matched_tokens + b[match_end:]
    return result
Requires at least 50% of overlap tokens to match. Falls back to LCS if threshold not met.

Overlap Duration

result = model.transcribe(
    "long_audio.wav",
    chunk_duration=120.0,    # 2-minute chunks
    overlap_duration=15.0    # 15-second overlap
)
Overlap duration controls the merging window:
  • Too small (<5s): Risk of missing matches, poor merging
  • Good range (10-20s): Reliable merging for most speech
  • Too large (>30s): Unnecessary computation, slower processing
Default of 15 seconds works well for natural speech.

Streaming Alignment

Streaming mode produces incremental results with draft and finalized tokens:
with model.transcribe_stream(context_size=(256, 256)) as transcriber:
    for chunk in audio_chunks:
        transcriber.add_audio(chunk)
        
        # Finalized tokens won't change
        finalized = transcriber.finalized_tokens
        
        # Draft tokens may change in next iteration
        draft = transcriber.draft_tokens
        
        # Combined result
        result = transcriber.result
# From parakeet.py:1056-1088
finalized_length = max(0, length - self.drop_size)

# Phase 1: Finalized region (won't change)
finalized_tokens, finalized_state = self.model.decode(
    features,
    mx.array([finalized_length]),
    [self.last_token],
    [self.decoder_hidden]
)
self.finalized_tokens.extend(finalized_tokens[0])

# Phase 2: Draft region (will be reprocessed)
draft_tokens, _ = self.model.decode(
    features[:, finalized_length:],
    mx.array([features.shape[1] - finalized_length]),
    [self.last_token],
    [self.decoder_hidden]
)
self.draft_tokens = draft_tokens[0]  # Replace, don't extend
Finalized tokens: Processed with enough context, won’t changeDraft tokens: At the end, may change as more audio arrives

Accessing Timestamps

Token Level

result = model.transcribe("audio.wav")

for token in result.tokens:
    print(f"{token.text:15s} [{token.start:6.2f}s - {token.end:6.2f}s] "
          f"confidence: {token.confidence:.3f}")

# Output:
# Hello          [ 0.00s -  0.16s] confidence: 0.987
#  world         [ 0.16s -  0.24s] confidence: 0.954
# .              [ 0.24s -  0.24s] confidence: 0.892

Sentence Level

for idx, sentence in enumerate(result.sentences):
    print(f"[{idx}] {sentence.text}")
    print(f"    Time: {sentence.start:.2f}s - {sentence.end:.2f}s "
          f"(duration: {sentence.duration:.2f}s)")
    print(f"    Confidence: {sentence.confidence:.3f}")
    print(f"    Tokens: {len(sentence.tokens)}")

# Output:
# [0] Hello world.
#     Time: 0.00s - 0.24s (duration: 0.24s)
#     Confidence: 0.945
#     Tokens: 3

Export Formats

Timestamps can be exported to various subtitle formats:
# SRT format
parakeet-mlx audio.mp3 --output-format srt

# VTT format with word-level highlights
parakeet-mlx audio.mp3 --output-format vtt --highlight-words

# JSON with full timestamp data
parakeet-mlx audio.mp3 --output-format json

Best Practices

Best timestamp accuracy: TDT models with beam search
model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")
result = model.transcribe(
    "audio.wav",
    decoding_config=DecodingConfig(
        decoding=Beam(beam_size=5, duration_reward=0.7)
    )
)
Fast approximate timestamps: CTC models
model = from_pretrained("mlx-community/parakeet-ctc-1.1b")
result = model.transcribe("audio.wav")  # Greedy only
Balanced: RNNT models
For subtitle generation, use max_words and max_duration:
result = model.transcribe(
    "video.mp4",
    decoding_config=DecodingConfig(
        sentence=SentenceConfig(
            max_words=10,       # Max 10 words per subtitle
            max_duration=5.0,   # Max 5 seconds per subtitle
            silence_gap=1.0     # Split on 1+ second pauses
        )
    )
)
This creates comfortable reading segments for viewers.
For audio longer than 2 minutes, use chunking:
result = model.transcribe(
    "podcast.mp3",
    chunk_duration=120.0,   # 2-minute chunks
    overlap_duration=15.0,  # 15-second overlap for merging
    chunk_callback=lambda pos, total: print(f"{pos}/{total} samples")
)
Chunking:
  • ✅ Prevents memory issues
  • ✅ Enables progress tracking
  • ✅ Maintains timestamp accuracy (with proper overlap)
  • ⚠️ Slightly slower due to overlap processing
Use confidence scores to identify uncertain regions:
# Find low-confidence sentences
uncertain = [s for s in result.sentences if s.confidence < 0.7]

# Find low-confidence tokens
uncertain_tokens = [t for t in result.tokens if t.confidence < 0.5]

# Compute average confidence
avg_confidence = sum(s.confidence for s in result.sentences) / len(result.sentences)

print(f"Average confidence: {avg_confidence:.3f}")
print(f"Uncertain regions: {len(uncertain)} / {len(result.sentences)}")
Low confidence may indicate:
  • Background noise or poor audio quality
  • Accents or unusual pronunciation
  • Technical jargon or rare words
  • Overlapping speech

Next Steps

Model Architectures

Learn how different models compute timestamps

Decoding Strategies

Understand greedy vs beam search impact

Build docs developers (and LLMs) love