Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/QwenLM/Qwen3-ASR/llms.txt

Use this file to discover all available pages before exploring further.

Qwen3-ASR is a family of open-source automatic speech recognition models developed by the Qwen team at Alibaba Cloud. The series includes Qwen3-ASR-1.7B and Qwen3-ASR-0.6B for multilingual speech, singing voice, and song transcription across 30 languages and 22 Chinese dialects, plus Qwen3-ForcedAligner-0.6B for word- and character-level timestamp prediction in 11 languages. Both ASR models support offline and streaming inference from a single checkpoint.

Quickstart

Transcribe your first audio file in under five minutes using the qwen-asr Python package.

Installation

Install the package with pip, set up a conda environment, and optionally enable FlashAttention 2.

Transformers Backend

Run offline batch inference with the HuggingFace Transformers backend and get timestamps.

vLLM Backend

Maximize throughput with the high-performance vLLM backend for production workloads.

Streaming Inference

Transcribe live audio in real time using the vLLM-powered streaming API.

Forced Aligner

Align existing transcripts to audio and obtain precise per-word timestamps.

Model Reference

Compare Qwen3-ASR-1.7B, 0.6B, and Qwen3-ForcedAligner model capabilities and downloads.

Fine-Tuning

Fine-tune Qwen3-ASR on your own audio data with single-GPU or multi-GPU training.

Key Features

52 Languages

30 spoken languages plus 22 Chinese dialects including Cantonese, Sichuan, and Wu.

Language Detection

Automatic language identification alongside transcription — no manual language tag required.

Offline & Streaming

A single model checkpoint handles both offline batch and real-time streaming transcription.

Music & Song Support

Transcribes singing voice and full songs with background music, not just clean speech.

OpenAI-Compatible API

vLLM serving exposes an OpenAI-compatible endpoint for easy integration with existing tooling.

Word Timestamps

Qwen3-ForcedAligner produces character- or word-level timestamps for 11 languages.

Quick Example

import torch
from qwen_asr import Qwen3ASRModel

model = Qwen3ASRModel.from_pretrained(
    "Qwen/Qwen3-ASR-1.7B",
    dtype=torch.bfloat16,
    device_map="cuda:0",
)

results = model.transcribe(
    audio="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav",
)

print(results[0].language)  # "English"
print(results[0].text)      # transcribed text

Getting Started

1

Install the package

pip install -U qwen-asr
Add [vllm] for the high-performance vLLM backend: pip install -U qwen-asr[vllm]
2

Load a model

from qwen_asr import Qwen3ASRModel
import torch

model = Qwen3ASRModel.from_pretrained(
    "Qwen/Qwen3-ASR-1.7B",
    dtype=torch.bfloat16,
    device_map="cuda:0",
)
3

Transcribe audio

Pass a local file path, URL, base64 string, or a (numpy_array, sample_rate) tuple to transcribe().
results = model.transcribe(audio="path/to/audio.wav")
print(results[0].language, results[0].text)
4

Enable timestamps (optional)

Add the Qwen3-ForcedAligner to get word-level timestamps alongside transcription. See the Forced Aligner guide for details.

Build docs developers (and LLMs) love