AUDIO_AGENT — a RealtimeAgent built on OpenAI’s realtime API — but differ in transport and audio encoding.
The voice orchestrator agent
The voice orchestrator is defined inconnectors/voice_orchestrator.py as a RealtimeAgent rather than a standard Agent:
RealtimeAgent streams audio bidirectionally with low latency using OpenAI’s realtime model. It does not batch text responses — it generates speech tokens incrementally as it processes the request.
Voice orchestrator tool set
The voice orchestrator uses a curated subset of all available tools. Tools that require screen rendering (e.g. Google Docs, Google Calendar event creation), involve multi-step confirmation flows, or are too slow for conversational use are excluded.Google Calendar, Google Docs, Gmail, Plaid, and Amadeus are not included in the voice tool set. These integrations involve multi-step confirmation flows, rich formatted output, or are not well-suited for real-time voice delivery.
Text agents vs. voice agents
- Text agents
- Voice agents (REALTIME_*)
Text agents are standard
Agent instances that use GPT-4o. They are optimized for rich, formatted output.- Full feature set: all 35+ agents available as handoffs
- Markdown-formatted responses with lists, bold, and code
- Support for complex multi-step operations
- Can return long-form content
- Responses are streamed character-by-character over WebSocket
- Token usage tracked per interaction
REALTIME_* variant alongside their standard version (e.g. REALTIME_FLIGHTAWARE_AGENT, YELP_REALTIME_AGENT). These variants carry voice-tuned system prompts that instruct the agent to keep responses brief and avoid formatting characters.
In-app voice (WebRTC)
In-app voice chat connects the browser directly to ODAI over a WebSocket. Audio is encoded as 16-bit PCM and transferred as base64. Endpoint:WSS /app/voice/stream/{session_id}?token={auth_token}
Session lifecycle
Connect
The client opens a WebSocket to
/app/voice/stream/{session_id}. The server instantiates a RealtimeWebSocketManager, creates a RealtimeRunner with the AUDIO_AGENT, and enters the session.Greeting
The server sends an automatic greeting message to the realtime session:
"Greet the user with 'Hello! Welcome to the O-die Voice Assistant...'". The model synthesizes speech and pushes audio chunks back to the client immediately.Audio exchange
The client sends audio frames as JSON:The server decodes the base64 payload and forwards the raw bytes to the OpenAI realtime session via
session.send_audio(audio_bytes).Receiving responses
The server pushes response events back to the client as JSON. Audio responses are base64-encoded:Other events include
agent_start, agent_end, tool_start, tool_end, audio_interrupted, and audio_end.Realtime session configuration (in-app)
Twilio phone call voice
Phone calls are routed through Twilio Media Streams. Audio is encoded as G.711 µ-law at 8 kHz to match Twilio’s native format, eliminating re-encoding overhead.Call flow
Incoming call
Twilio receives an inbound call and hits the ODAI webhook. The server responds with TwiML that instructs Twilio to open a Media Stream WebSocket back to ODAI.
WebSocket connection
Twilio connects to the ODAI WebSocket endpoint. The server instantiates a
TwilioHandler, starts a RealtimeRunner with G.711 µ-law audio formats, and accepts the WebSocket.Stream start
Twilio sends a
start event containing the streamSid and callSid. The handler fetches caller info, extracts the phone number, and sends a greeting message to the realtime session.Audio buffering
Incoming audio frames from Twilio are buffered in 50 ms chunks (400 bytes at 8 kHz). When the buffer reaches threshold or a 100 ms timeout elapses, the buffer is flushed to the OpenAI realtime session.
AI audio response
The realtime session emits A
audio events. The handler base64-encodes the audio and sends it back to Twilio as a media event:mark event is sent after each audio chunk for playback tracking.Interruption handling
Semantic VAD is used for phone calls. When the model detects the caller is speaking mid-response, it emits
audio_interrupted. The handler sends a Twilio clear event to stop queued audio playback immediately.Realtime session configuration (Twilio)
interrupt_response=True allows the caller to interrupt the assistant mid-response.