Parakeet MLX provides word-level timestamp alignment, tracking when each token starts and ends in the audio. The alignment system produces hierarchical results with token and sentence boundaries.
# From alignment.py:19-35@dataclassclass AlignedSentence: text: str # Full sentence text tokens: list[AlignedToken] # Constituent tokens start: float = 0.0 # Sentence start (from first token) end: float = 0.0 # Sentence end (from last token) duration: float = 0.0 # Sentence duration confidence: float = 1.0 # Aggregate confidence def __post_init__(self): self.tokens = list(sorted(self.tokens, key=lambda x: x.start)) self.start = self.tokens[0].start self.end = self.tokens[-1].end self.duration = self.end - self.start # Geometric mean of token confidences confidences = np.array([t.confidence for t in self.tokens]) self.confidence = float(np.exp(np.mean(np.log(confidences + 1e-10))))
Sentence confidence uses geometric mean instead of arithmetic mean. This makes the score more sensitive to low-confidence tokens - if any token has very low confidence, the sentence confidence will be low.
# From alignment.py:38-48@dataclassclass AlignedResult: text: str # Full transcription sentences: list[AlignedSentence] # Sentence segments def __post_init__(self): self.text = self.text.strip() @property def tokens(self) -> list[AlignedToken]: # Flatten all tokens from all sentences return [token for sentence in self.sentences for token in sentence.tokens]
The duration_reward parameter controls how much to trust duration predictions vs token predictions. Higher values (0.6-0.7) often improve timestamp accuracy.
# From alignment.py:51-55@dataclassclass SentenceConfig: max_words: int | None = None # Split after N words silence_gap: float | None = None # Split after N seconds of silence max_duration: float | None = None # Split after N seconds total
Maximum number of words per sentence. A “word” is detected by checking for space in token text.
# From alignment.py:79-86is_word_limit = ( (config.max_words is not None) and (idx != len(tokens) - 1) and ( len([x for x in current_tokens if " " in x.text]) + (1 if " " in tokens[idx + 1].text else 0) > config.max_words ))
Split sentences if silence between tokens exceeds this duration (in seconds).
# From alignment.py:88-92is_long_silence = ( (config.silence_gap is not None) and (idx != len(tokens) - 1) and (tokens[idx + 1].start - token.end >= config.silence_gap))
Split sentences after this many seconds of audio (regardless of content).
# From alignment.py:93-95is_over_duration = ( (config.max_duration is not None) and (token.end - current_tokens[0].start >= config.max_duration))
Prevents excessively long sentence segments.
Sentences are automatically split on sentence-ending punctuation:
# From alignment.py:67-78is_punctuation = ( "!" in token.text or "?" in token.text or "。" in token.text # Chinese/Japanese period or "?" in token.text # Chinese/Japanese question mark or "!" in token.text # Chinese/Japanese exclamation or ( "." in token.text and (idx == len(tokens) - 1 or " " in tokens[idx + 1].text) ))
Period (.) is only treated as sentence-ending if:
It’s the last token, OR
The next token starts with a space (avoiding “Mr. Smith”)
# From alignment.py:58-109 (simplified)def tokens_to_sentences( tokens: list[AlignedToken], config: SentenceConfig) -> list[AlignedSentence]: sentences = [] current_tokens = [] for idx, token in enumerate(tokens): current_tokens.append(token) # Check split conditions should_split = ( is_punctuation or is_word_limit or is_long_silence or is_over_duration ) if should_split: sentence_text = "".join(t.text for t in current_tokens) sentence = AlignedSentence( text=sentence_text, tokens=current_tokens ) sentences.append(sentence) current_tokens = [] # Handle remaining tokens if current_tokens: sentence = AlignedSentence( text="".join(t.text for t in current_tokens), tokens=current_tokens ) sentences.append(sentence) return sentences
Primary merge strategy that finds the longest contiguous matching subsequence:
# From alignment.py:116-194def merge_longest_contiguous( a: list[AlignedToken], b: list[AlignedToken], overlap_duration: float): # 1. Extract overlapping regions overlap_a = [token for token in a if token.end > b_start - overlap_duration] overlap_b = [token for token in b if token.start < a_end + overlap_duration] # 2. Find longest contiguous match best_contiguous = [] for i in range(len(overlap_a)): for j in range(len(overlap_b)): if overlap_a[i].id == overlap_b[j].id and \ abs(overlap_a[i].start - overlap_b[j].start) < overlap_duration / 2: # Extend match as far as possible current = [] k, l = i, j while k < len(overlap_a) and l < len(overlap_b) and \ overlap_a[k].id == overlap_b[l].id: current.append((k, l)) k += 1 l += 1 if len(current) > len(best_contiguous): best_contiguous = current # 3. Merge using contiguous sequence as anchor # Keep prefix from a, matched region, suffix from b result = a[:match_start] + matched_tokens + b[match_end:] return result
Requires at least 50% of overlap tokens to match. Falls back to LCS if threshold not met.
Fallback strategy using dynamic programming:
# From alignment.py:197-287def merge_longest_common_subsequence( a: list[AlignedToken], b: list[AlignedToken], overlap_duration: float): # 1. Extract overlapping regions overlap_a = [...] overlap_b = [...] # 2. Dynamic programming for LCS dp = [[0 for _ in range(len(overlap_b) + 1)] for _ in range(len(overlap_a) + 1)] for i in range(1, len(overlap_a) + 1): for j in range(1, len(overlap_b) + 1): if overlap_a[i-1].id == overlap_b[j-1].id and \ abs(overlap_a[i-1].start - overlap_b[j-1].start) < overlap_duration / 2: dp[i][j] = dp[i-1][j-1] + 1 else: dp[i][j] = max(dp[i-1][j], dp[i][j-1]) # 3. Backtrack to find LCS lcs_pairs = [] i, j = len(overlap_a), len(overlap_b) while i > 0 and j > 0: if overlap_a[i-1].id == overlap_b[j-1].id: lcs_pairs.append((i-1, j-1)) i -= 1 j -= 1 elif dp[i-1][j] > dp[i][j-1]: i -= 1 else: j -= 1 # 4. Merge using LCS as anchor return merged_tokens
LCS is more flexible but slower than contiguous matching.
# From parakeet.py:180-220 (simplified)all_tokens = []for start in range(0, len(audio), chunk_samples - overlap_samples): # Transcribe chunk chunk_audio = audio[start:end] chunk_result = model.generate(chunk_mel) # Adjust timestamps relative to full audio chunk_offset = start / sample_rate for sentence in chunk_result.sentences: for token in sentence.tokens: token.start += chunk_offset token.end = token.start + token.duration # Merge with previous chunks if all_tokens: try: all_tokens = merge_longest_contiguous( all_tokens, chunk_result.tokens, overlap_duration=overlap_duration ) except RuntimeError: # Contiguous merge failed all_tokens = merge_longest_common_subsequence( all_tokens, chunk_result.tokens, overlap_duration=overlap_duration ) else: all_tokens = chunk_result.tokens
Streaming mode produces incremental results with draft and finalized tokens:
with model.transcribe_stream(context_size=(256, 256)) as transcriber: for chunk in audio_chunks: transcriber.add_audio(chunk) # Finalized tokens won't change finalized = transcriber.finalized_tokens # Draft tokens may change in next iteration draft = transcriber.draft_tokens # Combined result result = transcriber.result
Best timestamp accuracy: TDT models with beam search
model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")result = model.transcribe( "audio.wav", decoding_config=DecodingConfig( decoding=Beam(beam_size=5, duration_reward=0.7) ))
Fast approximate timestamps: CTC models
model = from_pretrained("mlx-community/parakeet-ctc-1.1b")result = model.transcribe("audio.wav") # Greedy only
Balanced: RNNT models
Sentence segmentation for subtitles
For subtitle generation, use max_words and max_duration:
result = model.transcribe( "video.mp4", decoding_config=DecodingConfig( sentence=SentenceConfig( max_words=10, # Max 10 words per subtitle max_duration=5.0, # Max 5 seconds per subtitle silence_gap=1.0 # Split on 1+ second pauses ) ))
This creates comfortable reading segments for viewers.
Handling long audio
For audio longer than 2 minutes, use chunking:
result = model.transcribe( "podcast.mp3", chunk_duration=120.0, # 2-minute chunks overlap_duration=15.0, # 15-second overlap for merging chunk_callback=lambda pos, total: print(f"{pos}/{total} samples"))
Use confidence scores to identify uncertain regions:
# Find low-confidence sentencesuncertain = [s for s in result.sentences if s.confidence < 0.7]# Find low-confidence tokensuncertain_tokens = [t for t in result.tokens if t.confidence < 0.5]# Compute average confidenceavg_confidence = sum(s.confidence for s in result.sentences) / len(result.sentences)print(f"Average confidence: {avg_confidence:.3f}")print(f"Uncertain regions: {len(uncertain)} / {len(result.sentences)}")