Qwen3-ASR supports real-time streaming transcription, allowing you to feed raw PCM audio incrementally and read partial transcription results after each chunk. This makes it suitable for live microphone capture, telephony pipelines, and any scenario where you need low first-word latency rather than waiting for a complete recording.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/QwenLM/Qwen3-ASR/llms.txt
Use this file to discover all available pages before exploring further.
Overview and Limitations
Streaming transcription is only available with the vLLM backend (
Qwen3ASRModel.LLM). The Transformers backend does not support streaming. Use Qwen3ASRModel.from_pretrained for offline batch transcription instead.- Single stream only — one audio stream per
ASRStreamingStateobject. To transcribe multiple concurrent streams, create a separate state for each. - No batching — each
streaming_transcribecall processes a single chunk sequentially. - 16 kHz mono PCM — audio must be a 1-D
np.ndarrayoffloat32(orint16) samples at 16 000 Hz. The model does not resample streaming input automatically.
Prerequisites
Installqwen-asr with the vLLM extra and initialize the model inside a if __name__ == '__main__': guard:
Initializing Streaming State
Each audio stream is managed through anASRStreamingState object. Create one with init_streaming_state before you start feeding audio.
Parameters
Optional context string prepended to the system prompt. Works the same as the
context argument in transcribe().Optional language override. When set (e.g.
"English"), the prompt forces text-only output and skips language identification — consistent with how transcribe(language=...) behaves in offline mode.For the first
N chunks the prefix prompt is reset to an empty string, preventing early noise or silence from anchoring incorrect prefixes. Increase this value if the first few seconds of audio tend to be unreliable.After the
unfixed_chunk_num warmup phase, the last K tokens are rolled back from the accumulated output before it is used as a prefix for the next chunk. This reduces boundary jitter at chunk edges.Chunk duration in seconds. Audio is buffered internally and a decode step is triggered each time the buffer accumulates this many samples at 16 kHz. Smaller values give lower latency at the cost of more frequent (and potentially less accurate) partial results.
The Streaming Loop
Feed audio with streaming_transcribe
Call
streaming_transcribe(pcm16k, state) with any amount of new 16 kHz mono PCM samples. Internally, the function buffers samples until a full chunk is ready, then runs a decode step. You may call it with very small arrays (e.g. 10 ms of audio) — the buffer accumulates samples transparently.Flush remaining audio with finish_streaming_transcribe
When the audio stream ends, call
finish_streaming_transcribe(state) to flush any samples remaining in the internal buffer (the tail that did not fill a complete chunk). This performs one final decode and updates state.language and state.text with the definitive result.Reading Results from State
After eachstreaming_transcribe call (and after finish_streaming_transcribe), read the latest transcription from the state object:
Full Working Example
The following example downloads an audio file, resamples it to 16 kHz, and streams it through the model in configurable step sizes — simulating a real-time microphone feed.Tuning Tips
chunk_size_sec
Controls the latency/accuracy trade-off. A
2.0 s chunk gives the model enough context for accurate partial results. Reduce to 1.0 s or 0.5 s for lower latency in voice-assistant scenarios, at the cost of slightly higher WER.unfixed_chunk_num
Increase from
2 to 3 or 4 if the beginning of your stream is often noisy or silent. This prevents early garbage from anchoring an incorrect prefix in subsequent chunks.unfixed_token_num
The rollback window of
5 tokens handles most boundary jitter. Increase to 8–10 if you observe repeated words or phrases at chunk boundaries.max_new_tokens
Keep
max_new_tokens small (e.g. 32–64) for streaming. Each chunk is short, so large values just waste decode budget without improving accuracy.