Skip to main content
ODAI supports two voice interaction modes: in-app voice chat over WebRTC and phone call voice chat via Twilio. Both modes share the same underlying AUDIO_AGENT — a RealtimeAgent built on OpenAI’s realtime API — but differ in transport and audio encoding.

The voice orchestrator agent

The voice orchestrator is defined in connectors/voice_orchestrator.py as a RealtimeAgent rather than a standard Agent:
AUDIO_AGENT = RealtimeAgent(
    name="ODAI-Voice",
    instructions=RECOMMENDED_PROMPT_PREFIX + SYSTEM_MESSAGE,
    tools=[hangup_call, *VOICE_ORCHESTRATOR_TOOLS],
)
RealtimeAgent streams audio bidirectionally with low latency using OpenAI’s realtime model. It does not batch text responses — it generates speech tokens incrementally as it processes the request.

Voice orchestrator tool set

The voice orchestrator uses a curated subset of all available tools. Tools that require screen rendering (e.g. Google Docs, Google Calendar event creation), involve multi-step confirmation flows, or are too slow for conversational use are excluded.
VOICE_ORCHESTRATOR_TOOLS = [
    *COINMARKETCAP_TOOLS,
    *FINNHUB_TOOLS,
    *TRIPADVISOR_TOOLS,
    *GOOGLE_SHOPPING_TOOLS,
    *GOOGLE_NEWS_TOOLS,
    *GOOGLE_SEARCH_TOOLS,
    *AMTRAK_TOOLS,
    *TICKETMASTER_TOOLS,
    *WEATHERAPI_TOOLS,
    *EASYPOST_TOOLS,
    *MOVIEGLU_TOOLS,
    *YELP_TOOLS,
    *FETCH_WEBSITE_TOOLS,
    *FLIGHTAWARE_TOOLS,
]
Google Calendar, Google Docs, Gmail, Plaid, and Amadeus are not included in the voice tool set. These integrations involve multi-step confirmation flows, rich formatted output, or are not well-suited for real-time voice delivery.

Text agents vs. voice agents

Text agents are standard Agent instances that use GPT-4o. They are optimized for rich, formatted output.
  • Full feature set: all 35+ agents available as handoffs
  • Markdown-formatted responses with lists, bold, and code
  • Support for complex multi-step operations
  • Can return long-form content
  • Responses are streamed character-by-character over WebSocket
  • Token usage tracked per interaction
ORCHESTRATOR_AGENT = Agent(
    name="ODAI",
    model="gpt-4o",
    handoffs=[YELP_AGENT, GMAIL_AGENT, PLAID_AGENT, ...],
    model_settings=ModelSettings(include_usage=True)
)
Several individual agents expose a REALTIME_* variant alongside their standard version (e.g. REALTIME_FLIGHTAWARE_AGENT, YELP_REALTIME_AGENT). These variants carry voice-tuned system prompts that instruct the agent to keep responses brief and avoid formatting characters.

In-app voice (WebRTC)

In-app voice chat connects the browser directly to ODAI over a WebSocket. Audio is encoded as 16-bit PCM and transferred as base64. Endpoint: WSS /app/voice/stream/{session_id}?token={auth_token}

Session lifecycle

1

Connect

The client opens a WebSocket to /app/voice/stream/{session_id}. The server instantiates a RealtimeWebSocketManager, creates a RealtimeRunner with the AUDIO_AGENT, and enters the session.
2

Greeting

The server sends an automatic greeting message to the realtime session: "Greet the user with 'Hello! Welcome to the O-die Voice Assistant...'". The model synthesizes speech and pushes audio chunks back to the client immediately.
3

Audio exchange

The client sends audio frames as JSON:
{ "event": "audio", "payload": "<base64-encoded PCM>" }
The server decodes the base64 payload and forwards the raw bytes to the OpenAI realtime session via session.send_audio(audio_bytes).
4

Receiving responses

The server pushes response events back to the client as JSON. Audio responses are base64-encoded:
{ "event": "audio", "payload": "<base64-encoded PCM>" }
Other events include agent_start, agent_end, tool_start, tool_end, audio_interrupted, and audio_end.
5

Disconnect

When the WebSocket closes, RealtimeWebSocketManager.disconnect exits the session context, records the call duration via Segment analytics, and cleans up all session state.

Realtime session configuration (in-app)

REALTIME_RUN_CONFIG = RealtimeRunConfig(
    model_settings=RealtimeSessionModelSettings(
        voice="sage",
        turn_detection=RealtimeTurnDetectionConfig(
            type='server_vad',
            threshold=0.8,
            interrupt_response=False,
            silence_duration_ms=250
        )
    )
)
Server-side VAD (voice activity detection) is used with a 250 ms silence threshold. Interruptions are disabled so the assistant completes its current utterance before accepting new input.

Twilio phone call voice

Phone calls are routed through Twilio Media Streams. Audio is encoded as G.711 µ-law at 8 kHz to match Twilio’s native format, eliminating re-encoding overhead.

Call flow

1

Incoming call

Twilio receives an inbound call and hits the ODAI webhook. The server responds with TwiML that instructs Twilio to open a Media Stream WebSocket back to ODAI.
2

WebSocket connection

Twilio connects to the ODAI WebSocket endpoint. The server instantiates a TwilioHandler, starts a RealtimeRunner with G.711 µ-law audio formats, and accepts the WebSocket.
3

Stream start

Twilio sends a start event containing the streamSid and callSid. The handler fetches caller info, extracts the phone number, and sends a greeting message to the realtime session.
4

Audio buffering

Incoming audio frames from Twilio are buffered in 50 ms chunks (400 bytes at 8 kHz). When the buffer reaches threshold or a 100 ms timeout elapses, the buffer is flushed to the OpenAI realtime session.
self.CHUNK_LENGTH_S = 0.05   # 50ms chunks
self.SAMPLE_RATE = 8000      # Twilio uses 8kHz for g711_ulaw
self.BUFFER_SIZE_BYTES = int(self.SAMPLE_RATE * self.CHUNK_LENGTH_S)
5

AI audio response

The realtime session emits audio events. The handler base64-encodes the audio and sends it back to Twilio as a media event:
{
  "event": "media",
  "streamSid": "<stream_sid>",
  "media": { "payload": "<base64-encoded g711 audio>" }
}
A mark event is sent after each audio chunk for playback tracking.
6

Interruption handling

Semantic VAD is used for phone calls. When the model detects the caller is speaking mid-response, it emits audio_interrupted. The handler sends a Twilio clear event to stop queued audio playback immediately.
7

Call end

When Twilio sends a stop event, the handler records call duration and analytics via Segment and closes the session.

Realtime session configuration (Twilio)

REALTIME_MODEL_CONFIG = RealtimeModelConfig(
    initial_model_settings=RealtimeSessionModelSettings(
        voice="sage",
        turn_detection=RealtimeTurnDetectionConfig(
            type='semantic_vad',
            interrupt_response=True,
        ),
        input_audio_format='g711_ulaw',
        output_audio_format='g711_ulaw'
    )
)
Semantic VAD is preferred for phone calls because it understands natural pauses in speech better than energy-based VAD, reducing false interruptions over noisy phone lines. interrupt_response=True allows the caller to interrupt the assistant mid-response.

Build docs developers (and LLMs) love