Forced alignment is the process of aligning a known transcript to an audio recording to determine exactly when each word or character was spoken. Unlike end-to-end timestamp models that predict timing as a by-product of ASR, Qwen3-ForcedAligner takes the transcript as a given and focuses all its capacity on precise boundary prediction. The result is significantly more accurate timing data — with an average absolute error of under 53 ms across languages — outperforming WhisperX, NFA, and Monotonic Aligner across all evaluated benchmarks.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/QwenLM/Qwen3-ASR/llms.txt
Use this file to discover all available pages before exploring further.
What Is Forced Alignment?
Given a pair of(audio, transcript), the aligner produces a list of ForcedAlignItem objects, one per token (CJK character or word), each with a start_time and end_time in seconds. This is useful for:
- Subtitle generation with accurate word-level sync
- Speech data annotation and dataset curation
- Training TTS models that require phoneme-aligned data
- Karaoke-style highlighting in transcription UIs
Supported Languages
Qwen3-ForcedAligner-0.6B supports forced alignment for 11 languages:
| Language | Language | Language |
|---|---|---|
| Chinese | English | Cantonese |
| French | German | Italian |
| Japanese | Korean | Portuguese |
| Russian | Spanish |
Language names must be passed in canonical Title Case (e.g.
"Chinese", "English"). The aligner does not perform language identification — you must specify the language explicitly.Standalone Usage
You can useQwen3ForcedAligner independently of Qwen3ASRModel when you already have transcripts and only need timing information.
from_pretrained Parameters
Hugging Face repository ID (e.g.
"Qwen/Qwen3-ForcedAligner-0.6B") or a local directory path.Forwarded to
AutoModel.from_pretrained(...). Typical usage: dtype=torch.bfloat16, device_map="cuda:0", attn_implementation="flash_attention_2".align Parameters
Audio input(s). Accepted formats:
str— local file path, HTTPS URL, or base64 data URL (data:audio/wav;base64,...)(np.ndarray, int)— tuple of a waveform array and its sample ratelistof any of the above for batch processing
MAX_FORCE_ALIGN_INPUT_SECONDS); longer audio will not be split automatically and may produce degraded results.Transcript(s) to align. The aligner tokenizes the text using language-specific rules (character splitting for CJK, word splitting for space-delimited languages, etc.).
Language name(s) for each sample. Must be one of the 11 supported languages in Title Case. A single string is broadcast to the entire batch.
Using Alignment via Qwen3ASRModel.transcribe
The most convenient way to get timestamps is to load Qwen3ASRModel with a forced_aligner and call transcribe(..., return_time_stamps=True). The model transcribes the audio first, then runs forced alignment and populates ASRTranscription.time_stamps.
When
return_time_stamps=True, each audio sample is capped at 180 seconds per chunk (the aligner’s limit). Audio longer than 180 s is still split automatically; timestamps from each chunk are offset-corrected and merged into a single ForcedAlignResult.ForcedAlignResult and ForcedAlignItem
Thealign method returns a list[ForcedAlignResult], one entry per input audio.
ForcedAlignResult
Ordered list of aligned token spans for this sample.
ForcedAlignResult is iterable and supports len() and index access:
ForcedAlignItem
The aligned unit — a single CJK character for Chinese/Japanese/Cantonese, or a word (punctuation stripped) for space-delimited languages.
Start time in seconds, rounded to 3 decimal places.
End time in seconds, rounded to 3 decimal places.
Batch Alignment
Pass lists toalign to process multiple audio/transcript pairs in a single forward pass.
(np.ndarray, sr) tuples are all accepted:
Audio Input Limits
| Constant | Value | Scope |
|---|---|---|
MAX_FORCE_ALIGN_INPUT_SECONDS | 180 s | Per chunk in Qwen3ForcedAligner.align |
MAX_ASR_INPUT_SECONDS | 1200 s | Per sample in Qwen3ASRModel.transcribe (no timestamps) |
SAMPLE_RATE | 16 000 Hz | Required sample rate for all audio inputs |