OpenAI Realtime is a low-latency, stateful API for speech-to-speech and multimodal sessions. This page explains how to wire it up with Rokid Glasses — covering the two-plane architecture, three integration patterns, the backend SDP broker, Android client setup, image injection for vision-augmented responses, and the tool loop.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/RealComputer/GlassKit/llms.txt
Use this file to discover all available pages before exploring further.
How It Works
Think in two planes:- Media plane — Android’s WebRTC peer connection carries microphone audio, optional camera media, and remote assistant audio back to the glasses speaker.
- Control plane — JSON events move over the
oai-eventsWebRTC data channel and a backend WebSocket sideband channel attached to the same Realtime call.
https://api.openai.com/v1/realtime/calls with a local API key, but do not ship that pattern — it embeds the key in the APK.
The important objects are:
- Session — model, voice, instructions, audio config, tools, turn detection, and output modalities.
- Conversation item — a user message, assistant message, tool call, tool output, image input, or audio item.
- Response — one model turn. In the default path Realtime creates a response automatically after VAD detects the end of speech. Use
response.createwhen the backend adds an item that needs a model answer. - Sideband — a server-side control channel attached to the call, used for tools, workflow state, guardrails, and image injection.
Use
gpt-realtime-2 for new integrations. Start with reasoning.effort: "low" for responsive speech-to-speech behavior, then raise it only for workflows that need deeper multi-step planning.Pattern Matrix
Three patterns cover most Rokid Glasses use cases. Choose based on whether Android owns the mic, whether vision augmentation is required, and whether the backend drives each spoken turn.| Pattern | Android media into Realtime | Session input config | Response trigger | Backend sideband role |
|---|---|---|---|---|
| Direct assistant | Mic audio, optional camera video | semantic_vad; transcription enabled when client renders user text | Automatic VAD response | Handle tools and optional observability |
| Backend-augmented vision | Mic audio only; camera goes to backend vision service | Same as direct assistant, but create_response = False | Backend waits for committed user audio item, injects image/context, then sends response.create | Store latest vision result, inject context, handle tools |
| Output-only backend speech | No mic track; receive-only audio transceiver | Usually omit audio.input; configure only output voice | Backend creates text items and sends response.create | Own workflow state, cancel/replace speech, handle tools |
Session Configuration
- Direct Assistant
- Backend-Augmented Vision
- Backend-Controlled Speech
Use this when the model can own the conversation. Android streams microphone audio and optionally camera media. Keep automatic turn creation enabled.
System Instructions
Realtime instructions define the assistant’s role, speaking style, visual grounding rules, and tool policy. For smart glasses, keep spoken output short and explicitly handle unclear audio or poor framing.Backend SDP Broker
The backend accepts the SDP offer from Android, forwards it to OpenAI Realtime as a multipart form, and returns the SDP answer. It also opens the sideband WebSocket for control messages.Endpoint
Realtime Call Creation
Sideband Connection
session.update, call tools, insert conversation items, cancel active speech, and create responses.
Android Client Contract
Create the data channel
Use
oai-events as the data channel name. Create it before generating the offer.Add tracks and transceivers
Create data channels, local tracks, and receive-only transceivers before calling
createOffer. For direct assistant mode, add a local microphone audio track and set OfferToReceiveAudio to "true". For output-only mode, add a receive-only audio transceiver instead of a mic track.Complete the SDP exchange
Wait for ICE gathering, POST the full local SDP to your backend
/session/{id}/realtime endpoint, normalize the answer SDP, then set the remote description.Handle events and transcripts
For direct assistant mode parse these events from
oai-events:conversation.item.input_audio_transcription.completed— user textresponse.output_audio_transcript.done— final assistant textresponse.output_audio_transcript.delta— live captions only
speech_epoch changes.event_id where possible:
Backend-Controlled Speech
Use this pattern when a backend service owns workflow state and decides the exact spoken line:- Set
openai_response_active = Truewhen sendingresponse.createor receivingresponse.created. - Set it back to
Falseonresponse.done. - If
response.cancelreturns an error with coderesponse_cancel_not_active, treat it as benign and clear the flag.
Tool Loop
Keep tools on the backend. The sideband receives completed function calls fromresponse.done. Send a function output item, then continue only when the model should keep reasoning or speaking from that output.
Intermediate tools (lookup, list) should generally continue. Terminal action tools should not — the backend has already updated workflow state and will speak the next line itself.
Image Injection
For vision-augmented sessions, insert the latest frame after Realtime commits the user’s audio item, then create the response exactly once.Event Sequence
Wait for the item to appear
On
conversation.item.added, check that the item id is pending and that the item is a user message containing input_audio.