Voice Interface

ODAI supports two voice interaction modes: in-app voice chat over WebRTC and phone call voice chat via Twilio. Both modes share the same underlying AUDIO_AGENT — a RealtimeAgent built on OpenAI’s realtime API — but differ in transport and audio encoding.

The voice orchestrator agent

The voice orchestrator is defined in connectors/voice_orchestrator.py as a RealtimeAgent rather than a standard Agent:

AUDIO_AGENT = RealtimeAgent(
    name="ODAI-Voice",
    instructions=RECOMMENDED_PROMPT_PREFIX + SYSTEM_MESSAGE,
    tools=[hangup_call, *VOICE_ORCHESTRATOR_TOOLS],
)

RealtimeAgent streams audio bidirectionally with low latency using OpenAI’s realtime model. It does not batch text responses — it generates speech tokens incrementally as it processes the request.

Voice orchestrator tool set

The voice orchestrator uses a curated subset of all available tools. Tools that require screen rendering (e.g. Google Docs, Google Calendar event creation), involve multi-step confirmation flows, or are too slow for conversational use are excluded.

VOICE_ORCHESTRATOR_TOOLS = [
    *COINMARKETCAP_TOOLS,
    *FINNHUB_TOOLS,
    *TRIPADVISOR_TOOLS,
    *GOOGLE_SHOPPING_TOOLS,
    *GOOGLE_NEWS_TOOLS,
    *GOOGLE_SEARCH_TOOLS,
    *AMTRAK_TOOLS,
    *TICKETMASTER_TOOLS,
    *WEATHERAPI_TOOLS,
    *EASYPOST_TOOLS,
    *MOVIEGLU_TOOLS,
    *YELP_TOOLS,
    *FETCH_WEBSITE_TOOLS,
    *FLIGHTAWARE_TOOLS,
]

Google Calendar, Google Docs, Gmail, Plaid, and Amadeus are not included in the voice tool set. These integrations involve multi-step confirmation flows, rich formatted output, or are not well-suited for real-time voice delivery.

Text agents vs. voice agents

Text agents
Voice agents (REALTIME_*)

Text agents are standard Agent instances that use GPT-4o. They are optimized for rich, formatted output.

Full feature set: all 35+ agents available as handoffs
Markdown-formatted responses with lists, bold, and code
Support for complex multi-step operations
Can return long-form content
Responses are streamed character-by-character over WebSocket
Token usage tracked per interaction

ORCHESTRATOR_AGENT = Agent(
    name="ODAI",
    model="gpt-4o",
    handoffs=[YELP_AGENT, GMAIL_AGENT, PLAID_AGENT, ...],
    model_settings=ModelSettings(include_usage=True)
)

Voice agents are RealtimeAgent instances that use OpenAI’s realtime model. They are optimized for spoken delivery.

Reduced tool set scoped to voice-suitable APIs
Concise, conversational responses — no markdown, no URLs
Lists spoken as “first,” “second,” “third”
Special formatting: sources paraphrased, never read aloud
Audio streamed in 50 ms chunks as G.711 µ-law (Twilio) or PCM (in-app)
Includes hangup_call tool to terminate phone calls

AUDIO_AGENT = RealtimeAgent(
    name="ODAI-Voice",
    instructions=RECOMMENDED_PROMPT_PREFIX + SYSTEM_MESSAGE,
    tools=[hangup_call, *VOICE_ORCHESTRATOR_TOOLS],
)

Several individual agents expose a REALTIME_* variant alongside their standard version (e.g. REALTIME_FLIGHTAWARE_AGENT, YELP_REALTIME_AGENT). These variants carry voice-tuned system prompts that instruct the agent to keep responses brief and avoid formatting characters.

In-app voice (WebRTC)

In-app voice chat connects the browser directly to ODAI over a WebSocket. Audio is encoded as 16-bit PCM and transferred as base64. Endpoint: WSS /app/voice/stream/{session_id}?token={auth_token}

Session lifecycle

Connect

The client opens a WebSocket to /app/voice/stream/{session_id}. The server instantiates a RealtimeWebSocketManager, creates a RealtimeRunner with the AUDIO_AGENT, and enters the session.

Greeting

The server sends an automatic greeting message to the realtime session: "Greet the user with 'Hello! Welcome to the O-die Voice Assistant...'". The model synthesizes speech and pushes audio chunks back to the client immediately.

Audio exchange

The client sends audio frames as JSON:

{ "event": "audio", "payload": "<base64-encoded PCM>" }

The server decodes the base64 payload and forwards the raw bytes to the OpenAI realtime session via session.send_audio(audio_bytes).

Receiving responses

The server pushes response events back to the client as JSON. Audio responses are base64-encoded:

{ "event": "audio", "payload": "<base64-encoded PCM>" }

Other events include agent_start, agent_end, tool_start, tool_end, audio_interrupted, and audio_end.

Disconnect

When the WebSocket closes, RealtimeWebSocketManager.disconnect exits the session context, records the call duration via Segment analytics, and cleans up all session state.

Realtime session configuration (in-app)

REALTIME_RUN_CONFIG = RealtimeRunConfig(
    model_settings=RealtimeSessionModelSettings(
        voice="sage",
        turn_detection=RealtimeTurnDetectionConfig(
            type='server_vad',
            threshold=0.8,
            interrupt_response=False,
            silence_duration_ms=250
        )
    )
)

Server-side VAD (voice activity detection) is used with a 250 ms silence threshold. Interruptions are disabled so the assistant completes its current utterance before accepting new input.

Twilio phone call voice

Phone calls are routed through Twilio Media Streams. Audio is encoded as G.711 µ-law at 8 kHz to match Twilio’s native format, eliminating re-encoding overhead.

Call flow

Incoming call

Twilio receives an inbound call and hits the ODAI webhook. The server responds with TwiML that instructs Twilio to open a Media Stream WebSocket back to ODAI.

WebSocket connection

Twilio connects to the ODAI WebSocket endpoint. The server instantiates a TwilioHandler, starts a RealtimeRunner with G.711 µ-law audio formats, and accepts the WebSocket.

Stream start

Twilio sends a start event containing the streamSid and callSid. The handler fetches caller info, extracts the phone number, and sends a greeting message to the realtime session.

Audio buffering

Incoming audio frames from Twilio are buffered in 50 ms chunks (400 bytes at 8 kHz). When the buffer reaches threshold or a 100 ms timeout elapses, the buffer is flushed to the OpenAI realtime session.

self.CHUNK_LENGTH_S = 0.05   # 50ms chunks
self.SAMPLE_RATE = 8000      # Twilio uses 8kHz for g711_ulaw
self.BUFFER_SIZE_BYTES = int(self.SAMPLE_RATE * self.CHUNK_LENGTH_S)

AI audio response

The realtime session emits audio events. The handler base64-encodes the audio and sends it back to Twilio as a media event:

{
  "event": "media",
  "streamSid": "<stream_sid>",
  "media": { "payload": "<base64-encoded g711 audio>" }
}

A mark event is sent after each audio chunk for playback tracking.

Interruption handling

Semantic VAD is used for phone calls. When the model detects the caller is speaking mid-response, it emits audio_interrupted. The handler sends a Twilio clear event to stop queued audio playback immediately.

Call end

When Twilio sends a stop event, the handler records call duration and analytics via Segment and closes the session.

Realtime session configuration (Twilio)

REALTIME_MODEL_CONFIG = RealtimeModelConfig(
    initial_model_settings=RealtimeSessionModelSettings(
        voice="sage",
        turn_detection=RealtimeTurnDetectionConfig(
            type='semantic_vad',
            interrupt_response=True,
        ),
        input_audio_format='g711_ulaw',
        output_audio_format='g711_ulaw'
    )
)

Semantic VAD is preferred for phone calls because it understands natural pauses in speech better than energy-based VAD, reducing false interruptions over noisy phone lines. interrupt_response=True allows the caller to interrupt the assistant mid-response.

Get Started

Core Concepts

Integrations

Setup & Deployment

Security

Voice Interface

The voice orchestrator agent

Voice orchestrator tool set

Text agents vs. voice agents

In-app voice (WebRTC)

Session lifecycle

Realtime session configuration (in-app)

Twilio phone call voice

Call flow

Realtime session configuration (Twilio)

Build docs developers (and LLMs) love

Get Started

Core Concepts

Integrations

Setup & Deployment

Security

Documentation Index

​The voice orchestrator agent

​Voice orchestrator tool set

​Text agents vs. voice agents

​In-app voice (WebRTC)

​Session lifecycle

​Realtime session configuration (in-app)

​Twilio phone call voice

​Call flow

​Realtime session configuration (Twilio)

Build docs developers (and LLMs) love

The voice orchestrator agent

Voice orchestrator tool set

Text agents vs. voice agents

In-app voice (WebRTC)

Session lifecycle

Realtime session configuration (in-app)

Twilio phone call voice

Call flow

Realtime session configuration (Twilio)