Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/QwenLM/Qwen3-ASR/llms.txt

Use this file to discover all available pages before exploring further.

Qwen3-ASR supports real-time streaming transcription, allowing you to feed raw PCM audio incrementally and read partial transcription results after each chunk. This makes it suitable for live microphone capture, telephony pipelines, and any scenario where you need low first-word latency rather than waiting for a complete recording.

Overview and Limitations

Streaming transcription is only available with the vLLM backend (Qwen3ASRModel.LLM). The Transformers backend does not support streaming. Use Qwen3ASRModel.from_pretrained for offline batch transcription instead.
Streaming mode does not support timestamps. The forced aligner cannot operate on partial audio, so return_time_stamps has no effect during streaming. For timestamped output, use the offline transcribe method with return_time_stamps=True.
Additional constraints to be aware of:
  • Single stream only — one audio stream per ASRStreamingState object. To transcribe multiple concurrent streams, create a separate state for each.
  • No batching — each streaming_transcribe call processes a single chunk sequentially.
  • 16 kHz mono PCM — audio must be a 1-D np.ndarray of float32 (or int16) samples at 16 000 Hz. The model does not resample streaming input automatically.

Prerequisites

Install qwen-asr with the vLLM extra and initialize the model inside a if __name__ == '__main__': guard:
pip install -U qwen-asr[vllm]
from qwen_asr import Qwen3ASRModel

if __name__ == '__main__':
    model = Qwen3ASRModel.LLM(
        model="Qwen/Qwen3-ASR-1.7B",
        gpu_memory_utilization=0.8,
        max_new_tokens=32,  # keep small for streaming responsiveness
    )

Initializing Streaming State

Each audio stream is managed through an ASRStreamingState object. Create one with init_streaming_state before you start feeding audio.
state = model.init_streaming_state(
    context="",
    language=None,         # or e.g. "English" to force language
    unfixed_chunk_num=2,
    unfixed_token_num=5,
    chunk_size_sec=2.0,
)

Parameters

context
str
default:"\"\""
Optional context string prepended to the system prompt. Works the same as the context argument in transcribe().
language
str | None
default:"None"
Optional language override. When set (e.g. "English"), the prompt forces text-only output and skips language identification — consistent with how transcribe(language=...) behaves in offline mode.
unfixed_chunk_num
int
default:"2"
For the first N chunks the prefix prompt is reset to an empty string, preventing early noise or silence from anchoring incorrect prefixes. Increase this value if the first few seconds of audio tend to be unreliable.
unfixed_token_num
int
default:"5"
After the unfixed_chunk_num warmup phase, the last K tokens are rolled back from the accumulated output before it is used as a prefix for the next chunk. This reduces boundary jitter at chunk edges.
chunk_size_sec
float
default:"2.0"
Chunk duration in seconds. Audio is buffered internally and a decode step is triggered each time the buffer accumulates this many samples at 16 kHz. Smaller values give lower latency at the cost of more frequent (and potentially less accurate) partial results.

The Streaming Loop

1

Feed audio with streaming_transcribe

Call streaming_transcribe(pcm16k, state) with any amount of new 16 kHz mono PCM samples. Internally, the function buffers samples until a full chunk is ready, then runs a decode step. You may call it with very small arrays (e.g. 10 ms of audio) — the buffer accumulates samples transparently.
# pcm16k: 1-D np.ndarray of float32 or int16 at 16 kHz
state = model.streaming_transcribe(pcm16k, state)

# Read the latest partial result
print(f"language={state.language!r}  text={state.text!r}")
2

Flush remaining audio with finish_streaming_transcribe

When the audio stream ends, call finish_streaming_transcribe(state) to flush any samples remaining in the internal buffer (the tail that did not fill a complete chunk). This performs one final decode and updates state.language and state.text with the definitive result.
state = model.finish_streaming_transcribe(state)
print(f"[final] language={state.language!r}  text={state.text!r}")

Reading Results from State

After each streaming_transcribe call (and after finish_streaming_transcribe), read the latest transcription from the state object:
state.language   # str — latest detected/forced language, e.g. "English"
state.text       # str — latest partial transcription
Both fields are updated in-place after every completed chunk decode. They reflect the full transcription seen so far, not just the latest chunk.

Full Working Example

The following example downloads an audio file, resamples it to 16 kHz, and streams it through the model in configurable step sizes — simulating a real-time microphone feed.
import io
import urllib.request

import numpy as np
import soundfile as sf

from qwen_asr import Qwen3ASRModel

ASR_MODEL_PATH = "Qwen/Qwen3-ASR-1.7B"
URL_EN = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav"


def _resample_to_16k(wav: np.ndarray, sr: int) -> np.ndarray:
    if sr == 16000:
        return wav.astype(np.float32, copy=False)
    wav = wav.astype(np.float32, copy=False)
    dur = wav.shape[0] / float(sr)
    n16 = int(round(dur * 16000))
    if n16 <= 0:
        return np.zeros((0,), dtype=np.float32)
    x_old = np.linspace(0.0, dur, num=wav.shape[0], endpoint=False)
    x_new = np.linspace(0.0, dur, num=n16, endpoint=False)
    return np.interp(x_new, x_old, wav).astype(np.float32)


def run_streaming(model: Qwen3ASRModel, wav16k: np.ndarray, step_ms: int) -> None:
    sr = 16000
    step = int(round(step_ms / 1000.0 * sr))

    print(f"\n===== streaming step = {step_ms} ms =====")
    state = model.init_streaming_state(
        unfixed_chunk_num=2,
        unfixed_token_num=5,
        chunk_size_sec=2.0,
    )

    pos = 0
    call_id = 0
    while pos < wav16k.shape[0]:
        seg = wav16k[pos : pos + step]
        pos += seg.shape[0]
        call_id += 1
        model.streaming_transcribe(seg, state)
        print(f"[call {call_id:03d}] language={state.language!r} text={state.text!r}")

    model.finish_streaming_transcribe(state)
    print(f"[final] language={state.language!r} text={state.text!r}")


if __name__ == '__main__':
    model = Qwen3ASRModel.LLM(
        model=ASR_MODEL_PATH,
        gpu_memory_utilization=0.8,
        max_new_tokens=32,
    )

    req = urllib.request.Request(URL_EN, headers={"User-Agent": "Mozilla/5.0"})
    with urllib.request.urlopen(req) as resp:
        audio_bytes = resp.read()

    with io.BytesIO(audio_bytes) as f:
        wav, sr = sf.read(f, dtype="float32", always_2d=False)
    wav16k = _resample_to_16k(np.asarray(wav, dtype=np.float32), sr)

    run_streaming(model, wav16k, step_ms=500)

Tuning Tips

chunk_size_sec

Controls the latency/accuracy trade-off. A 2.0 s chunk gives the model enough context for accurate partial results. Reduce to 1.0 s or 0.5 s for lower latency in voice-assistant scenarios, at the cost of slightly higher WER.

unfixed_chunk_num

Increase from 2 to 3 or 4 if the beginning of your stream is often noisy or silent. This prevents early garbage from anchoring an incorrect prefix in subsequent chunks.

unfixed_token_num

The rollback window of 5 tokens handles most boundary jitter. Increase to 8–10 if you observe repeated words or phrases at chunk boundaries.

max_new_tokens

Keep max_new_tokens small (e.g. 32–64) for streaming. Each chunk is short, so large values just waste decode budget without improving accuracy.

Build docs developers (and LLMs) love