Overview
TheRealtime class is an abstract base class for LLMs that can receive and process both audio and video in real-time. It extends OmniLLM to provide a full multimodal interface with event-driven architecture.
Location: vision_agents.core.llm.realtime.Realtime
Usage
Constructor
The number of video frames per second to send to the model (for implementations that support setting fps).
Abstract Methods
Establish connection to the real-time API.Implementations should:
- Connect to the provider’s WebSocket or streaming API
- Call
_emit_connected_event()when ready - Set up message handlers
Process incoming audio and generate a response.Parameters:
pcm(PcmData): PCM audio data to processparticipant(Optional[Participant]): Participant who sent the audio
- Call
_emit_audio_input_event()when receiving audio - Forward audio to the provider’s API
- Call
_emit_audio_output_event()when generating response audio - Call
_emit_audio_output_done_event()when complete
Close the connection and clean up resources.Implementations should:
- Close WebSocket/streaming connections
- Cancel background tasks
- Call
_emit_disconnected_event()
Properties
Whether the connection is currently active.
UUID identifying this session. Automatically generated.
Name of the provider (e.g., “gemini_realtime”, “openai_realtime”).
Video frames per second being sent to the model.
Event Emission Methods
The Realtime class provides helper methods for emitting structured events:Connection Events
Emit a connected event when the session starts.Parameters:
session_config(Optional[dict]): Session configuration detailscapabilities(Optional[dict]): API capabilities (audio, video, etc.)
Emit a disconnected event when the session ends.Parameters:
reason(Optional[str]): Reason for disconnectionwas_clean(bool): Whether the disconnection was clean. Default: True
Audio Events
Emit an event when audio input is received.Parameters:
audio_data(PcmData): The audio datauser_metadata(Optional[dict]): User metadata
Emit an event when audio output is generated.Parameters:
audio_data(PcmData): The audio dataresponse_id(Optional[str]): Response identifieruser_metadata(Optional[dict]): User metadata
Emit an event when audio output is complete.Parameters:
response_id(Optional[str]): Response identifieruser_metadata(Optional[dict]): User metadata
Response Events
Emit a text response event.Parameters:
text(str): The response textresponse_id(Optional[str]): Response identifieris_complete(bool): Whether the response is complete. Default: Trueconversation_item_id(Optional[str]): Conversation item IDuser_metadata(Optional[dict]): User metadata
Transcription Events
Emit a user speech transcription event.Parameters:
text(str): Transcribed textoriginal(Optional[Any]): Original provider response
Emit an agent speech transcription event.Parameters:
text(str): Transcribed textoriginal(Optional[Any]): Original provider response
Error Events
Emit an error event.Parameters:
error(Exception): The error that occurredcontext(str): Error context. Default: ""user_metadata(Optional[dict]): User metadata
Plugin Example
Here’s how the Gemini Realtime plugin implements this interface:Events
Realtime implementations emit the following events:RealtimeConnectedEvent- When session connectsRealtimeDisconnectedEvent- When session disconnectsRealtimeAudioInputEvent- When audio input is receivedRealtimeAudioOutputEvent- When audio output is generatedRealtimeAudioOutputDoneEvent- When audio output completesRealtimeResponseEvent- Text responsesRealtimeUserSpeechTranscriptionEvent- User speech transcriptionsRealtimeAgentSpeechTranscriptionEvent- Agent speech transcriptionsRealtimeErrorEvent- Errors during processing