Multimodal Live API - Generative AI on Google Cloud

The Gemini Live API enables low-latency bidirectional voice and video interactions with Gemini. The Live API can process text, audio, and video input in real-time, providing text and audio output for natural conversational experiences.

Overview

The Live API provides WebSocket-based streaming for real-time multimodal conversations with sub-second latency:

Real-Time Audio

Stream audio input and receive natural speech responses with native audio processing

Video Streaming

Send video frames for visual understanding in real-time conversations

Low Latency

Sub-second response times for natural, interactive experiences

Function Calling

Integrate tools and APIs during live conversations

Key Features

Bidirectional streaming: Send and receive data simultaneously
Native audio processing: PCM audio at 24kHz sampling rate
Barge-in support: Interrupt model responses naturally
Multimodal input: Combine text, audio, and video in the same session
Tool integration: Call functions during conversations

Getting Started

Install Dependencies

Install the WebSocket library:

pip install --upgrade websockets

Set Up Authentication

Configure your Google Cloud project:

import os

PROJECT_ID = "your-project-id"
LOCATION = "us-central1"
MODEL_ID = "gemini-live-2.5-flash-native-audio"

# Generate access token
import subprocess
result = subprocess.run(
    ["gcloud", "auth", "print-access-token"],
    capture_output=True,
    text=True
)
access_token = result.stdout.strip()

Establish WebSocket Connection

Connect to the Live API endpoint:

import websockets
import json

api_host = f"{LOCATION}-aiplatform.googleapis.com"
service_url = (
    f"wss://{api_host}/ws/google.cloud.aiplatform.v1.LlmBidiService/"
    f"BidiGenerateContent"
)

headers = {
    "Authorization": f"Bearer {access_token}",
    "Content-Type": "application/json"
}

async with websockets.connect(service_url, additional_headers=headers) as ws:
    print("Connected to Gemini Live API")

Session Establishment

The Live API follows a strict WebSocket sub-protocol with four phases:

1. Handshake

Establish the WebSocket connection with OAuth 2.0 authentication:

import websockets

api_host = "us-central1-aiplatform.googleapis.com"
service_url = (
    f"wss://{api_host}/ws/google.cloud.aiplatform.v1.LlmBidiService/"
    f"BidiGenerateContent"
)

headers = {
    "Authorization": f"Bearer {access_token}",
    "Content-Type": "application/json"
}

async with websockets.connect(service_url, additional_headers=headers) as ws:
    print("Handshake complete")

2. Setup

Configure the session with model parameters:

setup = {
    "setup": {
        "model": f"projects/{PROJECT_ID}/locations/{LOCATION}/publishers/google/models/{MODEL_ID}",
        "generation_config": {
            "response_modalities": ["AUDIO"],
            "speech_config": {
                "voice_config": {
                    "prebuilt_voice_config": {
                        "voice_name": "Aoede"
                    }
                }
            }
        },
        "system_instruction": {
            "parts": [{"text": "You are a helpful assistant."}]
        }
    }
}

# Send setup message
await ws.send(json.dumps(setup))
setup_response = await ws.recv()
print("Setup complete")

3. Session Loop

Run bidirectional send and receive loops concurrently:

import asyncio
import base64
import numpy as np

async def main():
    async with websockets.connect(service_url, additional_headers=headers) as ws:
        # Send setup
        await ws.send(json.dumps(setup))
        await ws.recv()
        
        # Define send loop
        async def send_loop():
            try:
                while True:
                    # Simulate reading audio chunks (20ms PCM16 at 24kHz)
                    # In production, read from microphone
                    await asyncio.sleep(0.02)
            except asyncio.CancelledError:
                pass
        
        # Define receive loop
        async def receive_loop():
            try:
                async for message in ws:
                    response = json.loads(message)
                    
                    # Handle audio output
                    if "serverContent" in response:
                        parts = response["serverContent"].get("modelTurn", {}).get("parts", [])
                        for part in parts:
                            if "inlineData" in part:
                                # Audio data is base64-encoded PCM
                                pcm_data = base64.b64decode(part["inlineData"]["data"])
                                # Play audio or buffer for playback
                                
                    # Handle turn completion
                    if response.get("serverContent", {}).get("turnComplete"):
                        print("Turn complete")
                        
                    # Handle interruption (barge-in)
                    if response.get("interrupted"):
                        print("Model interrupted - stop playback")
                        
            except websockets.exceptions.ConnectionClosed:
                print("Connection closed")
        
        # Run both loops concurrently
        await asyncio.gather(send_loop(), receive_loop())

await main()

4. Termination

Close the WebSocket connection:

await ws.close()
print("Session terminated")

Message Types

Client Messages

Text Input
Audio Streaming
Video Streaming
Tool Response

Send text messages to the model:

async def send_text(ws, text_input: str):
    msg = {
        "client_content": {
            "turns": [
                {
                    "role": "user",
                    "parts": [{"text": text_input}]
                }
            ],
            "turn_complete": True
        }
    }
    await ws.send(json.dumps(msg))

await send_text(ws, "Hello, Gemini!")

Stream audio chunks in real-time:

async def send_audio(ws, audio_chunk: bytes):
    # audio_chunk should be PCM16, mono, 24kHz
    encoded = base64.b64encode(audio_chunk).decode('utf-8')
    
    msg = {
        "realtime_input": {
            "media_chunks": [
                {
                    "mime_type": "audio/pcm",
                    "data": encoded
                }
            ]
        }
    }
    await ws.send(json.dumps(msg))

Send video frames for visual understanding:

async def send_video_frame(ws, frame_data: bytes):
    # frame_data should be JPEG-encoded image
    encoded = base64.b64encode(frame_data).decode('utf-8')
    
    msg = {
        "realtime_input": {
            "media_chunks": [
                {
                    "mime_type": "image/jpeg",
                    "data": encoded
                }
            ]
        }
    }
    await ws.send(json.dumps(msg))

Send function call results back to the model:

async def send_tool_response(ws, function_name: str, result: dict):
    msg = {
        "tool_response": {
            "function_responses": [
                {
                    "name": function_name,
                    "response": result
                }
            ]
        }
    }
    await ws.send(json.dumps(msg))

Server Messages

Handle different types of responses from the server:

async def handle_server_message(message: str):
    response = json.loads(message)
    
    # Model's text/audio output
    if "serverContent" in response:
        model_turn = response["serverContent"].get("modelTurn", {})
        
        for part in model_turn.get("parts", []):
            # Text output
            if "text" in part:
                print(f"Model: {part['text']}")
            
            # Audio output (PCM16 at 24kHz)
            if "inlineData" in part:
                audio_data = base64.b64decode(part["inlineData"]["data"])
                # Play audio through speakers
                play_audio(audio_data)
        
        # Check if model finished speaking
        if response["serverContent"].get("turnComplete"):
            print("Model finished turn")
    
    # Model wants to call a function
    if "toolCall" in response:
        for call in response["toolCall"].get("functionCalls", []):
            function_name = call["name"]
            args = call["args"]
            
            # Execute function
            result = execute_function(function_name, args)
            
            # Send result back
            await send_tool_response(ws, function_name, result)
    
    # Model was interrupted (user started speaking)
    if response.get("interrupted"):
        print("Barge-in detected - stopping playback")
        stop_audio_playback()

Complete Example: Text to Speech

A simple text-to-speech example:

import asyncio
import websockets
import json
import base64
import numpy as np
from IPython.display import Audio, display

async def text_to_speech(text_input: str):
    async with websockets.connect(service_url, additional_headers=headers) as ws:
        # Setup
        await ws.send(json.dumps(setup))
        await ws.recv()
        
        # Send text
        msg = {
            "client_content": {
                "turns": [{"role": "user", "parts": [{"text": text_input}]}],
                "turn_complete": True
            }
        }
        await ws.send(json.dumps(msg))
        
        # Collect audio response
        audio_data = []
        async for message in ws:
            response = json.loads(message)
            
            # Extract audio
            if "serverContent" in response:
                parts = response["serverContent"].get("modelTurn", {}).get("parts", [])
                for part in parts:
                    if "inlineData" in part:
                        pcm_data = base64.b64decode(part["inlineData"]["data"])
                        audio_data.append(np.frombuffer(pcm_data, dtype=np.int16))
            
            # Check if complete
            if response.get("serverContent", {}).get("turnComplete"):
                break
        
        # Play audio
        if audio_data:
            display(Audio(np.concatenate(audio_data), rate=24000, autoplay=True))

# Use it
await text_to_speech("Hello! How are you today?")

Function Calling in Live Sessions

Integrate tools and APIs during conversations:

# Define tools in setup
setup_with_tools = {
    "setup": {
        "model": model_path,
        "tools": [
            {
                "function_declarations": [
                    {
                        "name": "get_weather",
                        "description": "Get current weather for a location",
                        "parameters": {
                            "type": "object",
                            "properties": {
                                "location": {
                                    "type": "string",
                                    "description": "City name"
                                }
                            },
                            "required": ["location"]
                        }
                    }
                ]
            }
        ]
    }
}

# Handle function calls in receive loop
async def receive_with_tools():
    async for message in ws:
        response = json.loads(message)
        
        if "toolCall" in response:
            for call in response["toolCall"].get("functionCalls", []):
                if call["name"] == "get_weather":
                    location = call["args"]["location"]
                    weather = fetch_weather(location)  # Your implementation
                    
                    # Send result back
                    await send_tool_response(ws, "get_weather", {
                        "temperature": weather["temp"],
                        "condition": weather["condition"]
                    })

Use Cases

Voice Assistants

Build natural voice interfaces for customer support, information retrieval, and task automation

Real-Time Translation

Provide live translation services with audio input and output

Gaming NPCs

Create interactive game characters with natural voice responses

Visual Q&A

Answer questions about live video feeds or camera input

Customer Service

Handle customer inquiries with voice and screen sharing

Education

Interactive tutoring with multimodal explanations

Best Practices

Audio Format Requirements

Format: PCM16 (Linear 16-bit PCM)
Sample rate: 24kHz
Channels: Mono (1 channel)
Chunk size: ~20ms recommended (480 samples)

Handle Barge-In ProperlyAlways stop audio playback immediately when receiving an interrupted message. Continuing to play after interruption creates a poor user experience.

if response.get("interrupted"):
    audio_queue.clear()
    stop_playback()

Performance Tips

Use asyncio: Run send and receive loops concurrently for lowest latency
Buffer management: Keep audio buffers small to minimize delay
Error handling: Implement reconnection logic for network issues
Token expiration: Refresh OAuth tokens before they expire (default 60 minutes)

# Refresh token before expiration
import time

token_expiry = time.time() + 3600  # 1 hour

if time.time() > token_expiry - 300:  # 5 min before expiry
    access_token = get_new_token()
    # Reconnect with new token

Supported Models

The Live API supports specific Gemini models optimized for real-time interaction:

gemini-live-2.5-flash-native-audio: Best for voice interactions
gemini-2.0-flash-exp: Experimental with multimodal support

See the official documentation for the latest model availability.

Next Steps

WebSocket Demo App

Complete reference implementation with React frontend

Native Audio SDK

Higher-level SDK for audio interactions

Function Calling

Learn more about integrating tools

Pricing

View Live API pricing details

Getting Started

Gemini Models

Agents

RAG & Search

Embeddings & Vector Search

Vision

Audio

Documentation Index

​Overview

Real-Time Audio

Video Streaming

Low Latency

Function Calling

​Key Features

​Getting Started

​Session Establishment

​1. Handshake

​2. Setup

​3. Session Loop

​4. Termination

​Message Types

​Client Messages

​Server Messages

​Complete Example: Text to Speech

​Function Calling in Live Sessions

​Use Cases

Voice Assistants

Real-Time Translation

Gaming NPCs

Visual Q&A

Customer Service

Education

​Best Practices

​Performance Tips

​Supported Models

​Next Steps

WebSocket Demo App

Native Audio SDK

Function Calling

Pricing

Build docs developers (and LLMs) love

Overview

Key Features

Getting Started

Session Establishment

1. Handshake

2. Setup

3. Session Loop

4. Termination

Message Types

Client Messages

Server Messages

Complete Example: Text to Speech

Function Calling in Live Sessions

Use Cases

Best Practices

Performance Tips

Supported Models

Next Steps