Real-Time Streaming ASR Transcription with Qwen3-ASR

Qwen3-ASR supports real-time streaming transcription, allowing you to feed raw PCM audio incrementally and read partial transcription results after each chunk. This makes it suitable for live microphone capture, telephony pipelines, and any scenario where you need low first-word latency rather than waiting for a complete recording.

Overview and Limitations

Streaming transcription is only available with the vLLM backend (Qwen3ASRModel.LLM). The Transformers backend does not support streaming. Use Qwen3ASRModel.from_pretrained for offline batch transcription instead.

Streaming mode does not support timestamps. The forced aligner cannot operate on partial audio, so return_time_stamps has no effect during streaming. For timestamped output, use the offline transcribe method with return_time_stamps=True.

Additional constraints to be aware of:

Single stream only — one audio stream per ASRStreamingState object. To transcribe multiple concurrent streams, create a separate state for each.
No batching — each streaming_transcribe call processes a single chunk sequentially.
16 kHz mono PCM — audio must be a 1-D np.ndarray of float32 (or int16) samples at 16 000 Hz. The model does not resample streaming input automatically.

Prerequisites

Install qwen-asr with the vLLM extra and initialize the model inside a if __name__ == '__main__': guard:

pip install -U qwen-asr[vllm]

from qwen_asr import Qwen3ASRModel

if __name__ == '__main__':
    model = Qwen3ASRModel.LLM(
        model="Qwen/Qwen3-ASR-1.7B",
        gpu_memory_utilization=0.8,
        max_new_tokens=32,  # keep small for streaming responsiveness
    )

Initializing Streaming State

Each audio stream is managed through an ASRStreamingState object. Create one with init_streaming_state before you start feeding audio.

state = model.init_streaming_state(
    context="",
    language=None,         # or e.g. "English" to force language
    unfixed_chunk_num=2,
    unfixed_token_num=5,
    chunk_size_sec=2.0,
)

Parameters

context

str

default:"\"\""

Optional context string prepended to the system prompt. Works the same as the context argument in transcribe().

language

str | None

default:"None"

Optional language override. When set (e.g. "English"), the prompt forces text-only output and skips language identification — consistent with how transcribe(language=...) behaves in offline mode.

unfixed_chunk_num

int

default:"2"

For the first N chunks the prefix prompt is reset to an empty string, preventing early noise or silence from anchoring incorrect prefixes. Increase this value if the first few seconds of audio tend to be unreliable.

unfixed_token_num

int

default:"5"

After the unfixed_chunk_num warmup phase, the last K tokens are rolled back from the accumulated output before it is used as a prefix for the next chunk. This reduces boundary jitter at chunk edges.

chunk_size_sec

float

default:"2.0"

Chunk duration in seconds. Audio is buffered internally and a decode step is triggered each time the buffer accumulates this many samples at 16 kHz. Smaller values give lower latency at the cost of more frequent (and potentially less accurate) partial results.

The Streaming Loop

Feed audio with streaming_transcribe

Call streaming_transcribe(pcm16k, state) with any amount of new 16 kHz mono PCM samples. Internally, the function buffers samples until a full chunk is ready, then runs a decode step. You may call it with very small arrays (e.g. 10 ms of audio) — the buffer accumulates samples transparently.

# pcm16k: 1-D np.ndarray of float32 or int16 at 16 kHz
state = model.streaming_transcribe(pcm16k, state)

# Read the latest partial result
print(f"language={state.language!r}  text={state.text!r}")

Flush remaining audio with finish_streaming_transcribe

When the audio stream ends, call finish_streaming_transcribe(state) to flush any samples remaining in the internal buffer (the tail that did not fill a complete chunk). This performs one final decode and updates state.language and state.text with the definitive result.

state = model.finish_streaming_transcribe(state)
print(f"[final] language={state.language!r}  text={state.text!r}")

Reading Results from State

After each streaming_transcribe call (and after finish_streaming_transcribe), read the latest transcription from the state object:

state.language   # str — latest detected/forced language, e.g. "English"
state.text       # str — latest partial transcription

Both fields are updated in-place after every completed chunk decode. They reflect the full transcription seen so far, not just the latest chunk.

Full Working Example

The following example downloads an audio file, resamples it to 16 kHz, and streams it through the model in configurable step sizes — simulating a real-time microphone feed.

import io
import urllib.request

import numpy as np
import soundfile as sf

from qwen_asr import Qwen3ASRModel

ASR_MODEL_PATH = "Qwen/Qwen3-ASR-1.7B"
URL_EN = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav"


def _resample_to_16k(wav: np.ndarray, sr: int) -> np.ndarray:
    if sr == 16000:
        return wav.astype(np.float32, copy=False)
    wav = wav.astype(np.float32, copy=False)
    dur = wav.shape[0] / float(sr)
    n16 = int(round(dur * 16000))
    if n16 <= 0:
        return np.zeros((0,), dtype=np.float32)
    x_old = np.linspace(0.0, dur, num=wav.shape[0], endpoint=False)
    x_new = np.linspace(0.0, dur, num=n16, endpoint=False)
    return np.interp(x_new, x_old, wav).astype(np.float32)


def run_streaming(model: Qwen3ASRModel, wav16k: np.ndarray, step_ms: int) -> None:
    sr = 16000
    step = int(round(step_ms / 1000.0 * sr))

    print(f"\n===== streaming step = {step_ms} ms =====")
    state = model.init_streaming_state(
        unfixed_chunk_num=2,
        unfixed_token_num=5,
        chunk_size_sec=2.0,
    )

    pos = 0
    call_id = 0
    while pos < wav16k.shape[0]:
        seg = wav16k[pos : pos + step]
        pos += seg.shape[0]
        call_id += 1
        model.streaming_transcribe(seg, state)
        print(f"[call {call_id:03d}] language={state.language!r} text={state.text!r}")

    model.finish_streaming_transcribe(state)
    print(f"[final] language={state.language!r} text={state.text!r}")


if __name__ == '__main__':
    model = Qwen3ASRModel.LLM(
        model=ASR_MODEL_PATH,
        gpu_memory_utilization=0.8,
        max_new_tokens=32,
    )

    req = urllib.request.Request(URL_EN, headers={"User-Agent": "Mozilla/5.0"})
    with urllib.request.urlopen(req) as resp:
        audio_bytes = resp.read()

    with io.BytesIO(audio_bytes) as f:
        wav, sr = sf.read(f, dtype="float32", always_2d=False)
    wav16k = _resample_to_16k(np.asarray(wav, dtype=np.float32), sr)

    run_streaming(model, wav16k, step_ms=500)

Tuning Tips

chunk_size_sec

Controls the latency/accuracy trade-off. A 2.0 s chunk gives the model enough context for accurate partial results. Reduce to 1.0 s or 0.5 s for lower latency in voice-assistant scenarios, at the cost of slightly higher WER.

unfixed_chunk_num

Increase from 2 to 3 or 4 if the beginning of your stream is often noisy or silent. This prevents early garbage from anchoring an incorrect prefix in subsequent chunks.

unfixed_token_num

The rollback window of 5 tokens handles most boundary jitter. Increase to 8–10 if you observe repeated words or phrases at chunk boundaries.

max_new_tokens

Keep max_new_tokens small (e.g. 32–64) for streaming. Each chunk is short, so large values just waste decode budget without improving accuracy.

Get Started

Inference

Deployment

Fine-Tuning

Reference

Real-Time Streaming ASR Transcription with Qwen3-ASR

Overview and Limitations

Prerequisites

Initializing Streaming State

Parameters

The Streaming Loop

Reading Results from State

Full Working Example

Tuning Tips

chunk_size_sec

unfixed_chunk_num

unfixed_token_num

max_new_tokens

Build docs developers (and LLMs) love

Get Started

Inference

Deployment

Fine-Tuning

Reference

Documentation Index

​Overview and Limitations

​Prerequisites

​Initializing Streaming State

​Parameters

​The Streaming Loop

​Reading Results from State

​Full Working Example

​Tuning Tips

chunk_size_sec

unfixed_chunk_num

unfixed_token_num

max_new_tokens

Build docs developers (and LLMs) love

Overview and Limitations

Prerequisites

Initializing Streaming State

Parameters

The Streaming Loop

Reading Results from State

Full Working Example

Tuning Tips