Offline Batch Transcription with the Transformers Backend

The Transformers backend is the simplest way to run Qwen3-ASR. It relies entirely on the standard Hugging Face transformers stack, so you can get up and running with a single pip install qwen-asr. It is the recommended starting point for experimentation, fine-tuning workflows, and deployments where installing vLLM is not practical.

When to Use the Transformers Backend

Use Transformers when…

You need a minimal, dependency-light setup
You are running on a single GPU or CPU
You are prototyping or evaluating the model
You need device_map multi-device placement

Consider vLLM when…

You need maximum throughput at scale
You are serving concurrent requests
You need streaming transcription support
Batch sizes exceed 64+ items regularly

Loading a Model

Use Qwen3ASRModel.from_pretrained to load the model with the Transformers backend. Model weights are downloaded automatically from Hugging Face on first use.

import torch
from qwen_asr import Qwen3ASRModel

model = Qwen3ASRModel.from_pretrained(
    "Qwen/Qwen3-ASR-1.7B",
    dtype=torch.bfloat16,
    device_map="cuda:0",
    # attn_implementation="flash_attention_2",
    max_inference_batch_size=32,
    max_new_tokens=256,
)

Parameters

pretrained_model_name_or_path

str

required

Hugging Face repository ID (e.g. "Qwen/Qwen3-ASR-1.7B") or a local directory path containing the model weights and config.

forced_aligner

str

default:"None"

Repository ID or local path of a Qwen3ForcedAligner model (e.g. "Qwen/Qwen3-ForcedAligner-0.6B"). Required when you intend to call transcribe(..., return_time_stamps=True). If omitted, timestamp requests will raise a ValueError.

forced_aligner_kwargs

dict

default:"None"

Keyword arguments forwarded verbatim to Qwen3ForcedAligner.from_pretrained(...). Accepts the same keys as **kwargs here, such as dtype, device_map, and attn_implementation.

max_inference_batch_size

int

default:"32"

Maximum number of audio chunks processed in a single forward pass. Set to -1 to disable chunking and process all inputs at once. Reduce this value when encountering GPU out-of-memory errors, especially with long audio inputs.

max_new_tokens

int

default:"512"

Maximum number of tokens the decoder may generate per chunk. The library default is 512. Increase this for very long audio inputs or dense speech; reduce it to speed up inference on short clips.

**kwargs

any

All remaining keyword arguments are forwarded directly to AutoModel.from_pretrained(...). Common options include dtype (e.g. torch.bfloat16), device_map (e.g. "cuda:0" or "auto"), and attn_implementation (e.g. "flash_attention_2").

Basic Transcription

Pass a single audio file as a URL, local path, base64 data URL, or a (np.ndarray, sr) waveform tuple. The result is always a list of ASRTranscription objects, one per input.

results = model.transcribe(
    audio="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav",
    language=None,  # set "English" to force the language
)

print(results[0].language)  # e.g. "English"
print(results[0].text)      # transcribed text

`transcribe` Parameters

audio

str | tuple | list

required

Audio input. Accepted formats:

str — local file path, HTTPS URL, or base64 data URL (data:audio/wav;base64,...)
(np.ndarray, int) — tuple of a mono or multi-channel waveform and its sample rate
list of any of the above for batch inference

All inputs are resampled to mono 16 kHz internally. Audio shorter than 0.5 s is zero-padded; audio longer than 1200 s is automatically split into chunks.

context

str | list[str]

default:"\"\""

Optional context string(s) prepended to the system prompt. Useful for domain hints or vocabulary biasing. A single string is broadcast to the full batch.

language

str | list[str | None] | None

default:"None"

Optional language override. When provided, the prompt is modified to force text-only output and skip language identification. Must be a canonical name from model.get_supported_languages() (e.g. "Chinese", "English"). Pass None for automatic language detection.

return_time_stamps

bool

default:"false"

When True, the model runs forced alignment after transcription and populates ASRTranscription.time_stamps with a ForcedAlignResult. Requires forced_aligner to have been provided at initialization.

Batch Transcription

Pass a list of audio inputs to process multiple files in a single call. You can mix URL strings, base64 data URLs, and (np.ndarray, sr) tuples freely in the same batch. Per-sample context and language overrides are also supported.

import base64, io, urllib.request
import numpy as np
import soundfile as sf

URL_ZH = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_zh.wav"
URL_EN = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav"

# Download and prepare different input formats
zh_bytes = urllib.request.urlopen(URL_ZH).read()
zh_b64 = "data:audio/wav;base64," + base64.b64encode(zh_bytes).decode()

en_bytes = urllib.request.urlopen(URL_EN).read()
en_wav, en_sr = sf.read(io.BytesIO(en_bytes), dtype="float32", always_2d=False)
en_wav = np.asarray(en_wav, dtype=np.float32)

results = model.transcribe(
    audio=[URL_ZH, zh_b64, (en_wav, en_sr)],
    context=["", "交易 停滞", ""],
    language=[None, "Chinese", "English"],
)

for i, r in enumerate(results):
    print(f"[{i}] {r.language}: {r.text}")

max_inference_batch_size controls how many audio chunks are forwarded in a single pass. If you have a large batch of long files, reduce this value to stay within GPU memory limits.

Forcing a Language

Set language to a canonical language name to skip language identification and request plain-text transcription output. This is slightly faster and avoids occasional misidentification on short clips.

results = model.transcribe(
    audio=URL_ZH,
    language="Chinese",
)
print(results[0].text)

Call model.get_supported_languages() to retrieve the full list of 30 supported languages and 22 Chinese dialects.

Getting Timestamps

To obtain word- or character-level timestamps, load the model with a forced_aligner and set return_time_stamps=True in transcribe. The aligner runs after ASR and populates result.time_stamps with a ForcedAlignResult.

import torch
from qwen_asr import Qwen3ASRModel

model = Qwen3ASRModel.from_pretrained(
    "Qwen/Qwen3-ASR-1.7B",
    dtype=torch.bfloat16,
    device_map="cuda:0",
    # attn_implementation="flash_attention_2",
    max_inference_batch_size=32,
    max_new_tokens=256,
    forced_aligner="Qwen/Qwen3-ForcedAligner-0.6B",
    forced_aligner_kwargs=dict(
        dtype=torch.bfloat16,
        device_map="cuda:0",
        # attn_implementation="flash_attention_2",
    ),
)

results = model.transcribe(
    audio=[
        "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_zh.wav",
        "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav",
    ],
    language=["Chinese", "English"],
    return_time_stamps=True,
)

for r in results:
    print(r.language, r.text)
    # Iterate over aligned tokens
    for item in r.time_stamps:
        print(f"  {item.text!r}: {item.start_time}s → {item.end_time}s")

When return_time_stamps=True, the maximum audio length per chunk is capped at 180 seconds (MAX_FORCE_ALIGN_INPUT_SECONDS) instead of the usual 1200 seconds, because the forced aligner has a shorter input limit. Long audio is still split automatically.

ASRTranscription Result Object

Each call to transcribe returns a list[ASRTranscription], one entry per input audio.

@dataclass
class ASRTranscription:
    language: str         # detected or forced language, e.g. "English" or "Chinese,English"
    text: str             # transcribed text
    time_stamps: Optional[ForcedAlignResult]  # populated only when return_time_stamps=True

Memory and Performance Tips

Use bfloat16

Load the model with dtype=torch.bfloat16. This halves memory compared to float32 with negligible accuracy impact on modern GPUs.

Enable FlashAttention 2

Install FlashAttention 2 and pass attn_implementation="flash_attention_2" to both from_pretrained and forced_aligner_kwargs. This significantly reduces memory and speeds up inference on long audio.

pip install -U flash-attn --no-build-isolation
# On machines with limited RAM / many CPU cores:
# MAX_JOBS=4 pip install -U flash-attn --no-build-isolation

Tune max_inference_batch_size

The default of 32 is a reasonable starting point. Reduce it if you hit OOM errors with long recordings. Set it to -1 to process all audio in one shot when you have ample VRAM and small inputs.

Increase max_new_tokens for long audio

Each chunk is decoded independently. The default 512 tokens is sufficient for roughly 60 seconds of speech at a normal speaking pace. For longer chunks or dense content, raise this value (e.g. 1024).

Get Started

Inference

Deployment

Fine-Tuning

Reference

Offline Batch Transcription with the Transformers Backend

When to Use the Transformers Backend

Use Transformers when…

Consider vLLM when…

Loading a Model

Parameters

Basic Transcription

`transcribe` Parameters

Batch Transcription

Forcing a Language

Getting Timestamps

ASRTranscription Result Object

Memory and Performance Tips

Build docs developers (and LLMs) love

Get Started

Inference

Deployment

Fine-Tuning

Reference

Documentation Index

​When to Use the Transformers Backend

Use Transformers when…

Consider vLLM when…

​Loading a Model

​Parameters

​Basic Transcription

​transcribe Parameters

​Batch Transcription

​Forcing a Language

​Getting Timestamps

​ASRTranscription Result Object

​Memory and Performance Tips

Build docs developers (and LLMs) love

When to Use the Transformers Backend

Loading a Model

Parameters

Basic Transcription

`transcribe` Parameters

Batch Transcription

Forcing a Language

Getting Timestamps

ASRTranscription Result Object

Memory and Performance Tips