The WebSocket demo provides a web-based interface for real-time streaming text-to-speech synthesis with VibeVoice.
Quick Start
Launch the Server
Start the WebSocket server using the demo launcher:python demo/vibevoice_realtime_demo.py \
--port 3000 \
--model_path microsoft/VibeVoice-Realtime-0.5B \
--device cuda
Open the Web Interface
Navigate to the demo in your browser: Generate Speech
- Select a voice from the dropdown
- Enter your text
- Adjust generation parameters (optional)
- Click generate to hear the result
Server Configuration
Command-Line Arguments
Port number for the web server
model_path
string
default:"default_model"
Path to the HuggingFace model directory or model ID (e.g., microsoft/VibeVoice-Realtime-0.5B)
Device for inference. Options: cuda, mps, cpuThe mpx typo is automatically corrected to mps
Enable auto-reload for development. Use --reload flag to enable
Launch Examples
python demo/vibevoice_realtime_demo.py \
--port 3000 \
--model_path microsoft/VibeVoice-Realtime-0.5B \
--device cuda
WebSocket API
The demo exposes a WebSocket endpoint for streaming audio generation.
Endpoint
ws://localhost:3000/stream
Query Parameters
The text to synthesize into speech
Voice preset name. Available voices are loaded from demo/voices/streaming_model/
Classifier-Free Guidance scale. Higher values increase prompt adherence
Number of diffusion inference steps. More steps may improve quality but increase latency
Connection Example
const ws = new WebSocket(
'ws://localhost:3000/stream?text=Hello%20world&voice=Wayne&cfg=1.5&steps=5'
);
ws.binaryType = 'arraybuffer';
ws.onmessage = (event) => {
if (typeof event.data === 'string') {
// JSON log messages
const message = JSON.parse(event.data);
console.log(message.event, message.data);
} else {
// Binary audio data (PCM16)
const audioChunk = new Int16Array(event.data);
// Play or buffer the audio chunk
}
};
The WebSocket streams audio in the following format:
- Sample Rate: 24,000 Hz
- Encoding: PCM 16-bit signed integer
- Channels: Mono
- Byte Order: Little-endian
Playing Audio in Browser
const audioContext = new AudioContext({ sampleRate: 24000 });
const chunks = [];
ws.onmessage = (event) => {
if (event.data instanceof ArrayBuffer) {
const pcm16 = new Int16Array(event.data);
// Convert PCM16 to Float32 for Web Audio API
const float32 = new Float32Array(pcm16.length);
for (let i = 0; i < pcm16.length; i++) {
float32[i] = pcm16[i] / 32768.0;
}
chunks.push(float32);
}
};
ws.onclose = () => {
// Concatenate all chunks and play
const totalLength = chunks.reduce((sum, chunk) => sum + chunk.length, 0);
const audioData = new Float32Array(totalLength);
let offset = 0;
chunks.forEach(chunk => {
audioData.set(chunk, offset);
offset += chunk.length;
});
const audioBuffer = audioContext.createBuffer(1, audioData.length, 24000);
audioBuffer.getChannelData(0).set(audioData);
const source = audioContext.createBufferSource();
source.buffer = audioBuffer;
source.connect(audioContext.destination);
source.start();
};
Log Messages
The WebSocket sends JSON log messages during generation:
Event Types
{
"type": "log",
"event": "backend_request_received",
"data": {
"text_length": 42,
"cfg_scale": 1.5,
"inference_steps": 5,
"voice": "Wayne"
},
"timestamp": "2026-03-03 14:22:15.123"
}
REST Endpoints
Get Available Voices
curl http://localhost:3000/config
Response:
{
"voices": [
"en-WHTest_man",
"Wayne",
"Speaker01",
"Speaker02"
],
"default_voice": "en-WHTest_man"
}
Serve Static Files
The root endpoint / serves the web interface HTML:
curl http://localhost:3000/
Concurrency Control
The WebSocket server implements a lock to handle one request at a time:
If a generation is in progress, new connections will receive a backend_busy message and be closed with code 1013.
if lock.locked():
busy_message = {
"type": "log",
"event": "backend_busy",
"data": {"message": "Please wait for the other requests to complete."},
"timestamp": get_timestamp(),
}
await ws.send_text(json.dumps(busy_message))
await ws.close(code=1013, reason="Service busy")
For production use with multiple concurrent users, consider deploying multiple instances behind a load balancer.
Environment Variables
The server reads configuration from environment variables:
MODEL_PATH: Model path (set automatically from --model_path)
MODEL_DEVICE: Device for inference (set automatically from --device)
VOICE_PRESET: Default voice preset name (optional)
Troubleshooting
Server Won’t Start
Check if the port is already in use:
Use a different port:
python demo/vibevoice_realtime_demo.py --port 8080
MPS Not Available Warning
If you see:
Warning: MPS not available. Falling back to CPU.
This means your system doesn’t support MPS. Use CUDA or CPU instead:
python demo/vibevoice_realtime_demo.py --device cpu
Voice Directory Not Found
If voice files aren’t loading:
RuntimeError: Voices directory not found: /path/to/demo/voices/streaming_model
Ensure voice .pt files exist in demo/voices/streaming_model/.