Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/QwenLM/Qwen3-ASR/llms.txt

Use this file to discover all available pages before exploring further.

Forced alignment is the process of aligning a known transcript to an audio recording to determine exactly when each word or character was spoken. Unlike end-to-end timestamp models that predict timing as a by-product of ASR, Qwen3-ForcedAligner takes the transcript as a given and focuses all its capacity on precise boundary prediction. The result is significantly more accurate timing data — with an average absolute error of under 53 ms across languages — outperforming WhisperX, NFA, and Monotonic Aligner across all evaluated benchmarks.

What Is Forced Alignment?

Given a pair of (audio, transcript), the aligner produces a list of ForcedAlignItem objects, one per token (CJK character or word), each with a start_time and end_time in seconds. This is useful for:
  • Subtitle generation with accurate word-level sync
  • Speech data annotation and dataset curation
  • Training TTS models that require phoneme-aligned data
  • Karaoke-style highlighting in transcription UIs

Supported Languages

Qwen3-ForcedAligner-0.6B supports forced alignment for 11 languages:
LanguageLanguageLanguage
ChineseEnglishCantonese
FrenchGermanItalian
JapaneseKoreanPortuguese
RussianSpanish
Language names must be passed in canonical Title Case (e.g. "Chinese", "English"). The aligner does not perform language identification — you must specify the language explicitly.

Standalone Usage

You can use Qwen3ForcedAligner independently of Qwen3ASRModel when you already have transcripts and only need timing information.
1

Load the aligner

import torch
from qwen_asr import Qwen3ForcedAligner

aligner = Qwen3ForcedAligner.from_pretrained(
    "Qwen/Qwen3-ForcedAligner-0.6B",
    dtype=torch.bfloat16,
    device_map="cuda:0",
    # attn_implementation="flash_attention_2",
)
2

Align a single audio + transcript pair

results = aligner.align(
    audio="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_zh.wav",
    text="甚至出现交易几乎停滞的情况。",
    language="Chinese",
)

print(results[0])
# Iterate over aligned tokens
for item in results[0]:
    print(f"{item.text!r}: {item.start_time}s → {item.end_time}s")

# Or access by index
first = results[0][0]
print(first.text, first.start_time, first.end_time)

from_pretrained Parameters

pretrained_model_name_or_path
str
required
Hugging Face repository ID (e.g. "Qwen/Qwen3-ForcedAligner-0.6B") or a local directory path.
**kwargs
any
Forwarded to AutoModel.from_pretrained(...). Typical usage: dtype=torch.bfloat16, device_map="cuda:0", attn_implementation="flash_attention_2".

align Parameters

audio
str | tuple | list
required
Audio input(s). Accepted formats:
  • str — local file path, HTTPS URL, or base64 data URL (data:audio/wav;base64,...)
  • (np.ndarray, int) — tuple of a waveform array and its sample rate
  • list of any of the above for batch processing
All inputs are resampled to mono 16 kHz internally. Maximum audio length is 180 seconds (MAX_FORCE_ALIGN_INPUT_SECONDS); longer audio will not be split automatically and may produce degraded results.
text
str | list[str]
required
Transcript(s) to align. The aligner tokenizes the text using language-specific rules (character splitting for CJK, word splitting for space-delimited languages, etc.).
language
str | list[str]
required
Language name(s) for each sample. Must be one of the 11 supported languages in Title Case. A single string is broadcast to the entire batch.

Using Alignment via Qwen3ASRModel.transcribe

The most convenient way to get timestamps is to load Qwen3ASRModel with a forced_aligner and call transcribe(..., return_time_stamps=True). The model transcribes the audio first, then runs forced alignment and populates ASRTranscription.time_stamps.
import torch
from qwen_asr import Qwen3ASRModel

model = Qwen3ASRModel.from_pretrained(
    "Qwen/Qwen3-ASR-1.7B",
    dtype=torch.bfloat16,
    device_map="cuda:0",
    forced_aligner="Qwen/Qwen3-ForcedAligner-0.6B",
    forced_aligner_kwargs=dict(
        dtype=torch.bfloat16,
        device_map="cuda:0",
    ),
    max_inference_batch_size=32,
    max_new_tokens=256,
)

results = model.transcribe(
    audio="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav",
    language="English",
    return_time_stamps=True,
)

r = results[0]
print(r.language, r.text)
for item in r.time_stamps:
    print(f"  {item.text!r}: {item.start_time}s → {item.end_time}s")
When return_time_stamps=True, each audio sample is capped at 180 seconds per chunk (the aligner’s limit). Audio longer than 180 s is still split automatically; timestamps from each chunk are offset-corrected and merged into a single ForcedAlignResult.

ForcedAlignResult and ForcedAlignItem

The align method returns a list[ForcedAlignResult], one entry per input audio.

ForcedAlignResult

items
List[ForcedAlignItem]
Ordered list of aligned token spans for this sample.
ForcedAlignResult is iterable and supports len() and index access:
result = results[0]
len(result)       # number of aligned tokens
result[0]         # first ForcedAlignItem
for item in result:
    ...           # iterate over all items

ForcedAlignItem

text
str
The aligned unit — a single CJK character for Chinese/Japanese/Cantonese, or a word (punctuation stripped) for space-delimited languages.
start_time
float
Start time in seconds, rounded to 3 decimal places.
end_time
float
End time in seconds, rounded to 3 decimal places.

Batch Alignment

Pass lists to align to process multiple audio/transcript pairs in a single forward pass.
import torch
from qwen_asr import Qwen3ForcedAligner

aligner = Qwen3ForcedAligner.from_pretrained(
    "Qwen/Qwen3-ForcedAligner-0.6B",
    dtype=torch.bfloat16,
    device_map="cuda:0",
)

results = aligner.align(
    audio=[
        "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_zh.wav",
        "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav",
    ],
    text=[
        "甚至出现交易几乎停滞的情况。",
        "He wasn't even that big when I started listening to him.",
    ],
    language=["Chinese", "English"],
)

for i, r in enumerate(results):
    print(f"[{i}] {len(r)} tokens")
    print(f"  first: {r[0].text!r} {r[0].start_time}s → {r[0].end_time}s")
    print(f"  last : {r[-1].text!r} {r[-1].start_time}s → {r[-1].end_time}s")
You can also mix input formats in the same batch — URL strings, base64 data URLs, and (np.ndarray, sr) tuples are all accepted:
import base64, io, numpy as np, soundfile as sf, urllib.request

# Prepare a (np.ndarray, sr) input
en_bytes = urllib.request.urlopen(URL_EN).read()
en_wav, en_sr = sf.read(io.BytesIO(en_bytes), dtype="float32", always_2d=False)

# Prepare a base64 input
zh_bytes = urllib.request.urlopen(URL_ZH).read()
zh_b64 = "data:audio/wav;base64," + base64.b64encode(zh_bytes).decode()

results = aligner.align(
    audio=[URL_ZH, zh_b64, (np.asarray(en_wav, dtype=np.float32), en_sr)],
    text=[TEXT_ZH, TEXT_ZH, TEXT_EN],
    language=["Chinese", "Chinese", "English"],
)

Audio Input Limits

The forced aligner has a maximum input length of 180 seconds (MAX_FORCE_ALIGN_INPUT_SECONDS = 180). Audio longer than this limit should be pre-split before calling align. When using Qwen3ASRModel.transcribe(..., return_time_stamps=True), splitting and offset correction are handled automatically.
ConstantValueScope
MAX_FORCE_ALIGN_INPUT_SECONDS180 sPer chunk in Qwen3ForcedAligner.align
MAX_ASR_INPUT_SECONDS1200 sPer sample in Qwen3ASRModel.transcribe (no timestamps)
SAMPLE_RATE16 000 HzRequired sample rate for all audio inputs

Build docs developers (and LLMs) love