Skip to main content
The WebSocket demo provides a web-based interface for real-time streaming text-to-speech synthesis with VibeVoice.

Quick Start

1

Launch the Server

Start the WebSocket server using the demo launcher:
python demo/vibevoice_realtime_demo.py \
  --port 3000 \
  --model_path microsoft/VibeVoice-Realtime-0.5B \
  --device cuda
2

Open the Web Interface

Navigate to the demo in your browser:
http://localhost:3000
3

Generate Speech

  • Select a voice from the dropdown
  • Enter your text
  • Adjust generation parameters (optional)
  • Click generate to hear the result

Server Configuration

Command-Line Arguments

port
integer
default:"3000"
Port number for the web server
model_path
string
default:"default_model"
Path to the HuggingFace model directory or model ID (e.g., microsoft/VibeVoice-Realtime-0.5B)
device
string
default:"cuda"
Device for inference. Options: cuda, mps, cpu
The mpx typo is automatically corrected to mps
reload
boolean
default:"false"
Enable auto-reload for development. Use --reload flag to enable

Launch Examples

python demo/vibevoice_realtime_demo.py \
  --port 3000 \
  --model_path microsoft/VibeVoice-Realtime-0.5B \
  --device cuda

WebSocket API

The demo exposes a WebSocket endpoint for streaming audio generation.

Endpoint

ws://localhost:3000/stream

Query Parameters

text
string
required
The text to synthesize into speech
voice
string
Voice preset name. Available voices are loaded from demo/voices/streaming_model/
cfg
float
default:"1.5"
Classifier-Free Guidance scale. Higher values increase prompt adherence
steps
integer
default:"5"
Number of diffusion inference steps. More steps may improve quality but increase latency

Connection Example

const ws = new WebSocket(
  'ws://localhost:3000/stream?text=Hello%20world&voice=Wayne&cfg=1.5&steps=5'
);

ws.binaryType = 'arraybuffer';

ws.onmessage = (event) => {
  if (typeof event.data === 'string') {
    // JSON log messages
    const message = JSON.parse(event.data);
    console.log(message.event, message.data);
  } else {
    // Binary audio data (PCM16)
    const audioChunk = new Int16Array(event.data);
    // Play or buffer the audio chunk
  }
};

Audio Format

The WebSocket streams audio in the following format:
  • Sample Rate: 24,000 Hz
  • Encoding: PCM 16-bit signed integer
  • Channels: Mono
  • Byte Order: Little-endian

Playing Audio in Browser

const audioContext = new AudioContext({ sampleRate: 24000 });
const chunks = [];

ws.onmessage = (event) => {
  if (event.data instanceof ArrayBuffer) {
    const pcm16 = new Int16Array(event.data);
    
    // Convert PCM16 to Float32 for Web Audio API
    const float32 = new Float32Array(pcm16.length);
    for (let i = 0; i < pcm16.length; i++) {
      float32[i] = pcm16[i] / 32768.0;
    }
    
    chunks.push(float32);
  }
};

ws.onclose = () => {
  // Concatenate all chunks and play
  const totalLength = chunks.reduce((sum, chunk) => sum + chunk.length, 0);
  const audioData = new Float32Array(totalLength);
  
  let offset = 0;
  chunks.forEach(chunk => {
    audioData.set(chunk, offset);
    offset += chunk.length;
  });
  
  const audioBuffer = audioContext.createBuffer(1, audioData.length, 24000);
  audioBuffer.getChannelData(0).set(audioData);
  
  const source = audioContext.createBufferSource();
  source.buffer = audioBuffer;
  source.connect(audioContext.destination);
  source.start();
};

Log Messages

The WebSocket sends JSON log messages during generation:

Event Types

{
  "type": "log",
  "event": "backend_request_received",
  "data": {
    "text_length": 42,
    "cfg_scale": 1.5,
    "inference_steps": 5,
    "voice": "Wayne"
  },
  "timestamp": "2026-03-03 14:22:15.123"
}

REST Endpoints

Get Available Voices

curl http://localhost:3000/config
Response:
{
  "voices": [
    "en-WHTest_man",
    "Wayne",
    "Speaker01",
    "Speaker02"
  ],
  "default_voice": "en-WHTest_man"
}

Serve Static Files

The root endpoint / serves the web interface HTML:
curl http://localhost:3000/

Concurrency Control

The WebSocket server implements a lock to handle one request at a time:
If a generation is in progress, new connections will receive a backend_busy message and be closed with code 1013.
if lock.locked():
    busy_message = {
        "type": "log",
        "event": "backend_busy",
        "data": {"message": "Please wait for the other requests to complete."},
        "timestamp": get_timestamp(),
    }
    await ws.send_text(json.dumps(busy_message))
    await ws.close(code=1013, reason="Service busy")
For production use with multiple concurrent users, consider deploying multiple instances behind a load balancer.

Environment Variables

The server reads configuration from environment variables:
  • MODEL_PATH: Model path (set automatically from --model_path)
  • MODEL_DEVICE: Device for inference (set automatically from --device)
  • VOICE_PRESET: Default voice preset name (optional)

Troubleshooting

Server Won’t Start

Check if the port is already in use:
lsof -i :3000
Use a different port:
python demo/vibevoice_realtime_demo.py --port 8080

MPS Not Available Warning

If you see:
Warning: MPS not available. Falling back to CPU.
This means your system doesn’t support MPS. Use CUDA or CPU instead:
python demo/vibevoice_realtime_demo.py --device cpu

Voice Directory Not Found

If voice files aren’t loading:
RuntimeError: Voices directory not found: /path/to/demo/voices/streaming_model
Ensure voice .pt files exist in demo/voices/streaming_model/.

Build docs developers (and LLMs) love