Word-Level Timestamps with Qwen3-ForcedAligner-0.6B

Forced alignment is the process of aligning a known transcript to an audio recording to determine exactly when each word or character was spoken. Unlike end-to-end timestamp models that predict timing as a by-product of ASR, Qwen3-ForcedAligner takes the transcript as a given and focuses all its capacity on precise boundary prediction. The result is significantly more accurate timing data — with an average absolute error of under 53 ms across languages — outperforming WhisperX, NFA, and Monotonic Aligner across all evaluated benchmarks.

What Is Forced Alignment?

Given a pair of (audio, transcript), the aligner produces a list of ForcedAlignItem objects, one per token (CJK character or word), each with a start_time and end_time in seconds. This is useful for:

Subtitle generation with accurate word-level sync
Speech data annotation and dataset curation
Training TTS models that require phoneme-aligned data
Karaoke-style highlighting in transcription UIs

Supported Languages

Qwen3-ForcedAligner-0.6B supports forced alignment for 11 languages:

Language	Language	Language
Chinese	English	Cantonese
French	German	Italian
Japanese	Korean	Portuguese
Russian	Spanish

Language names must be passed in canonical Title Case (e.g. "Chinese", "English"). The aligner does not perform language identification — you must specify the language explicitly.

Standalone Usage

You can use Qwen3ForcedAligner independently of Qwen3ASRModel when you already have transcripts and only need timing information.

Load the aligner

import torch
from qwen_asr import Qwen3ForcedAligner

aligner = Qwen3ForcedAligner.from_pretrained(
    "Qwen/Qwen3-ForcedAligner-0.6B",
    dtype=torch.bfloat16,
    device_map="cuda:0",
    # attn_implementation="flash_attention_2",
)

Align a single audio + transcript pair

results = aligner.align(
    audio="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_zh.wav",
    text="甚至出现交易几乎停滞的情况。",
    language="Chinese",
)

print(results[0])
# Iterate over aligned tokens
for item in results[0]:
    print(f"{item.text!r}: {item.start_time}s → {item.end_time}s")

# Or access by index
first = results[0][0]
print(first.text, first.start_time, first.end_time)

`from_pretrained` Parameters

pretrained_model_name_or_path

str

required

Hugging Face repository ID (e.g. "Qwen/Qwen3-ForcedAligner-0.6B") or a local directory path.

**kwargs

any

Forwarded to AutoModel.from_pretrained(...). Typical usage: dtype=torch.bfloat16, device_map="cuda:0", attn_implementation="flash_attention_2".

`align` Parameters

audio

str | tuple | list

required

Audio input(s). Accepted formats:

str — local file path, HTTPS URL, or base64 data URL (data:audio/wav;base64,...)
(np.ndarray, int) — tuple of a waveform array and its sample rate
list of any of the above for batch processing

All inputs are resampled to mono 16 kHz internally. Maximum audio length is 180 seconds (MAX_FORCE_ALIGN_INPUT_SECONDS); longer audio will not be split automatically and may produce degraded results.

text

str | list[str]

required

Transcript(s) to align. The aligner tokenizes the text using language-specific rules (character splitting for CJK, word splitting for space-delimited languages, etc.).

language

str | list[str]

required

Language name(s) for each sample. Must be one of the 11 supported languages in Title Case. A single string is broadcast to the entire batch.

Using Alignment via `Qwen3ASRModel.transcribe`

The most convenient way to get timestamps is to load Qwen3ASRModel with a forced_aligner and call transcribe(..., return_time_stamps=True). The model transcribes the audio first, then runs forced alignment and populates ASRTranscription.time_stamps.

import torch
from qwen_asr import Qwen3ASRModel

model = Qwen3ASRModel.from_pretrained(
    "Qwen/Qwen3-ASR-1.7B",
    dtype=torch.bfloat16,
    device_map="cuda:0",
    forced_aligner="Qwen/Qwen3-ForcedAligner-0.6B",
    forced_aligner_kwargs=dict(
        dtype=torch.bfloat16,
        device_map="cuda:0",
    ),
    max_inference_batch_size=32,
    max_new_tokens=256,
)

results = model.transcribe(
    audio="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav",
    language="English",
    return_time_stamps=True,
)

r = results[0]
print(r.language, r.text)
for item in r.time_stamps:
    print(f"  {item.text!r}: {item.start_time}s → {item.end_time}s")

When return_time_stamps=True, each audio sample is capped at 180 seconds per chunk (the aligner’s limit). Audio longer than 180 s is still split automatically; timestamps from each chunk are offset-corrected and merged into a single ForcedAlignResult.

ForcedAlignResult and ForcedAlignItem

The align method returns a list[ForcedAlignResult], one entry per input audio.

ForcedAlignResult

items

List[ForcedAlignItem]

Ordered list of aligned token spans for this sample.

ForcedAlignResult is iterable and supports len() and index access:

result = results[0]
len(result)       # number of aligned tokens
result[0]         # first ForcedAlignItem
for item in result:
    ...           # iterate over all items

ForcedAlignItem

text

str

The aligned unit — a single CJK character for Chinese/Japanese/Cantonese, or a word (punctuation stripped) for space-delimited languages.

start_time

float

Start time in seconds, rounded to 3 decimal places.

end_time

float

End time in seconds, rounded to 3 decimal places.

Batch Alignment

Pass lists to align to process multiple audio/transcript pairs in a single forward pass.

import torch
from qwen_asr import Qwen3ForcedAligner

aligner = Qwen3ForcedAligner.from_pretrained(
    "Qwen/Qwen3-ForcedAligner-0.6B",
    dtype=torch.bfloat16,
    device_map="cuda:0",
)

results = aligner.align(
    audio=[
        "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_zh.wav",
        "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav",
    ],
    text=[
        "甚至出现交易几乎停滞的情况。",
        "He wasn't even that big when I started listening to him.",
    ],
    language=["Chinese", "English"],
)

for i, r in enumerate(results):
    print(f"[{i}] {len(r)} tokens")
    print(f"  first: {r[0].text!r} {r[0].start_time}s → {r[0].end_time}s")
    print(f"  last : {r[-1].text!r} {r[-1].start_time}s → {r[-1].end_time}s")

You can also mix input formats in the same batch — URL strings, base64 data URLs, and (np.ndarray, sr) tuples are all accepted:

import base64, io, numpy as np, soundfile as sf, urllib.request

# Prepare a (np.ndarray, sr) input
en_bytes = urllib.request.urlopen(URL_EN).read()
en_wav, en_sr = sf.read(io.BytesIO(en_bytes), dtype="float32", always_2d=False)

# Prepare a base64 input
zh_bytes = urllib.request.urlopen(URL_ZH).read()
zh_b64 = "data:audio/wav;base64," + base64.b64encode(zh_bytes).decode()

results = aligner.align(
    audio=[URL_ZH, zh_b64, (np.asarray(en_wav, dtype=np.float32), en_sr)],
    text=[TEXT_ZH, TEXT_ZH, TEXT_EN],
    language=["Chinese", "Chinese", "English"],
)

Audio Input Limits

The forced aligner has a maximum input length of 180 seconds (MAX_FORCE_ALIGN_INPUT_SECONDS = 180). Audio longer than this limit should be pre-split before calling align. When using Qwen3ASRModel.transcribe(..., return_time_stamps=True), splitting and offset correction are handled automatically.

Constant	Value	Scope
`MAX_FORCE_ALIGN_INPUT_SECONDS`	180 s	Per chunk in `Qwen3ForcedAligner.align`
`MAX_ASR_INPUT_SECONDS`	1200 s	Per sample in `Qwen3ASRModel.transcribe` (no timestamps)
`SAMPLE_RATE`	16 000 Hz	Required sample rate for all audio inputs

Get Started

Inference

Deployment

Fine-Tuning

Reference

Word-Level Timestamps with Qwen3-ForcedAligner-0.6B

What Is Forced Alignment?

Supported Languages

Standalone Usage

`from_pretrained` Parameters

`align` Parameters

Using Alignment via `Qwen3ASRModel.transcribe`

ForcedAlignResult and ForcedAlignItem

ForcedAlignResult

ForcedAlignItem

Batch Alignment

Audio Input Limits

Build docs developers (and LLMs) love

Get Started

Inference

Deployment

Fine-Tuning

Reference

Documentation Index

​What Is Forced Alignment?

​Supported Languages

​Standalone Usage

​from_pretrained Parameters

​align Parameters

​Using Alignment via Qwen3ASRModel.transcribe

​ForcedAlignResult and ForcedAlignItem

​ForcedAlignResult

​ForcedAlignItem

​Batch Alignment

​Audio Input Limits

Build docs developers (and LLMs) love

What Is Forced Alignment?

Supported Languages

Standalone Usage

`from_pretrained` Parameters

`align` Parameters

Using Alignment via `Qwen3ASRModel.transcribe`

ForcedAlignResult and ForcedAlignItem

ForcedAlignResult

ForcedAlignItem

Batch Alignment

Audio Input Limits