OpenAI Realtime Integration for Rokid Glasses Apps

OpenAI Realtime is a low-latency, stateful API for speech-to-speech and multimodal sessions. This page explains how to wire it up with Rokid Glasses — covering the two-plane architecture, three integration patterns, the backend SDP broker, Android client setup, image injection for vision-augmented responses, and the tool loop.

How It Works

Think in two planes:

Media plane — Android’s WebRTC peer connection carries microphone audio, optional camera media, and remote assistant audio back to the glasses speaker.
Control plane — JSON events move over the oai-events WebRTC data channel and a backend WebSocket sideband channel attached to the same Realtime call.

Android sends an SDP offer to your backend. The backend creates the Realtime call and forwards the SDP answer back to Android. For personal or hackathon work you can POST the offer directly to https://api.openai.com/v1/realtime/calls with a local API key, but do not ship that pattern — it embeds the key in the APK. The important objects are:

Session — model, voice, instructions, audio config, tools, turn detection, and output modalities.
Conversation item — a user message, assistant message, tool call, tool output, image input, or audio item.
Response — one model turn. In the default path Realtime creates a response automatically after VAD detects the end of speech. Use response.create when the backend adds an item that needs a model answer.
Sideband — a server-side control channel attached to the call, used for tools, workflow state, guardrails, and image injection.

Use gpt-realtime-2 for new integrations. Start with reasoning.effort: "low" for responsive speech-to-speech behavior, then raise it only for workflows that need deeper multi-step planning.

Pattern Matrix

Three patterns cover most Rokid Glasses use cases. Choose based on whether Android owns the mic, whether vision augmentation is required, and whether the backend drives each spoken turn.

Pattern	Android media into Realtime	Session input config	Response trigger	Backend sideband role
Direct assistant	Mic audio, optional camera video	`semantic_vad`; transcription enabled when client renders user text	Automatic VAD response	Handle tools and optional observability
Backend-augmented vision	Mic audio only; camera goes to backend vision service	Same as direct assistant, but `create_response = False`	Backend waits for committed user audio item, injects image/context, then sends `response.create`	Store latest vision result, inject context, handle tools
Output-only backend speech	No mic track; receive-only audio transceiver	Usually omit `audio.input`; configure only output voice	Backend creates text items and sends `response.create`	Own workflow state, cancel/replace speech, handle tools

Session Configuration

Direct Assistant
Backend-Augmented Vision
Backend-Controlled Speech

Use this when the model can own the conversation. Android streams microphone audio and optionally camera media. Keep automatic turn creation enabled.

session_config = {
    "type": "realtime",
    "model": "gpt-realtime-2",
    "reasoning": {"effort": "low"},
    "audio": {
        "input": {
            "noise_reduction": {"type": "near_field"},
            "transcription": {"language": "en", "model": "whisper-1"},
            "turn_detection": {"type": "semantic_vad"},
        },
        "output": {"voice": "marin"},
    },
    "instructions": SESSION_INSTRUCTIONS,
    "tools": [...],
}

Android runs two links: audio to OpenAI Realtime brokered by the backend, and camera video to a backend vision service. Disable automatic response creation so the backend can inject image context first.

"audio": {
    "input": {
        "noise_reduction": {"type": "near_field"},
        "transcription": {"language": "en", "model": "whisper-1"},
        "turn_detection": {
            "type": "semantic_vad",
            "create_response": False,
            "interrupt_response": False,
        },
    },
}

Use for server-authoritative workflows where the backend decides each step and exact spoken line. Add a receive-only audio transceiver on Android so the SDP offer has an audio section.

session_config = {
    "type": "realtime",
    "model": "gpt-realtime-2",
    "reasoning": {"effort": "low"},
    "audio": {"output": {"voice": "cedar"}},
    "instructions": OPENAI_SESSION_INSTRUCTIONS,
    "tools": [...],
}

The backend sends text conversation items such as "Speak exactly this line: ..." followed by response.create.

System Instructions

Realtime instructions define the assistant’s role, speaking style, visual grounding rules, and tool policy. For smart glasses, keep spoken output short and explicitly handle unclear audio or poor framing.

SESSION_INSTRUCTIONS = """
# Role
- You are a voice assistant running on smart glasses.
- Help the user complete the current real-world task using speech, tool results, and the latest visual context.

# Speaking Style
- Be concise, concrete, and actionable.
- Use no more than two short sentences per response unless the user asks for detail.
- Do not use sound effects, filler, or stage directions.

# Visual Grounding
- Treat the camera view as the user's current field of view.
- If the image is unclear, blocked, or missing the relevant object, ask the user to adjust their view.
- Do not claim that you can see an object unless the current visual context supports it.

# Tools and Backend State
- Call backend tools for private data, workflow decisions, or external actions.
- Do not invent step progression when the backend owns the workflow state.
- If the user's message starts with `Speak exactly this line:`, speak that line exactly and do not add commentary.
""".strip()

Backend SDP Broker

The backend accepts the SDP offer from Android, forwards it to OpenAI Realtime as a multipart form, and returns the SDP answer. It also opens the sideband WebSocket for control messages.

Endpoint

@app.post("/session/{session_id}/realtime")
async def create_realtime_session(session_id: str, request: Request) -> Response:
    offer_sdp = (await request.body()).decode()
    if not offer_sdp.strip():
        raise HTTPException(status_code=422, detail="offer SDP must not be empty")

    answer_sdp = await session_manager.create_realtime_session(session_id, offer_sdp)
    return Response(content=answer_sdp, media_type="application/sdp")

Realtime Call Creation

session_config = {
    "type": "realtime",
    "model": "gpt-realtime-2",
    "reasoning": {"effort": "low"},
    "audio": {
        "input": {
            "noise_reduction": {"type": "near_field"},
            "transcription": {"language": "en", "model": "whisper-1"},
            "turn_detection": {
                "type": "semantic_vad",
                "create_response": False,
                "interrupt_response": False,
            },
        },
        "output": {"voice": "cedar"},
    },
    "instructions": SESSION_INSTRUCTIONS,
}

form = {
    "sdp": (None, offer_sdp),
    "session": (None, json.dumps(session_config)),
}

upstream = await openai_http.post(
    "https://api.openai.com/v1/realtime/calls",
    headers={"Authorization": f"Bearer {openai_api_key}"},
    files=form,
)
upstream.raise_for_status()

answer_sdp = normalize_sdp(upstream.text)
call_id = upstream.headers["location"].rstrip("/").split("/")[-1]

Validate both outputs before returning to Android:

if not call_id or not answer_sdp.startswith("v="):
    raise HTTPException(
        status_code=502,
        detail="OpenAI Realtime response missing call_id or valid answer SDP",
    )

Sideband Connection

sideband_url = f"wss://api.openai.com/v1/realtime?call_id={call_id}"
async with websockets.connect(
    sideband_url,
    additional_headers={"Authorization": f"Bearer {openai_api_key}"},
) as openai_sideband:
    async for raw in openai_sideband:
        event = json.loads(raw)
        ...

The sideband is the backend’s control channel. It can monitor session events, send session.update, call tools, insert conversation items, cancel active speech, and create responses.

Android Client Contract

Create the data channel

Use oai-events as the data channel name. Create it before generating the offer.

val eventsChannel = peerConnection.createDataChannel(
    "oai-events",
    DataChannel.Init()
)

Add tracks and transceivers

Create data channels, local tracks, and receive-only transceivers before calling createOffer. For direct assistant mode, add a local microphone audio track and set OfferToReceiveAudio to "true". For output-only mode, add a receive-only audio transceiver instead of a mic track.

Complete the SDP exchange

Wait for ICE gathering, POST the full local SDP to your backend /session/{id}/realtime endpoint, normalize the answer SDP, then set the remote description.

Handle events and transcripts

For direct assistant mode parse these events from oai-events:

conversation.item.input_audio_transcription.completed — user text
response.output_audio_transcript.done — final assistant text
response.output_audio_transcript.delta — live captions only

For output-only backend speech, render transcript deltas only for the current speech item and clear stale text when the backend speech_epoch changes.

Deduplicate server events by event_id where possible:

private fun shouldIgnoreEvent(json: JSONObject): Boolean {
    val eventId = json.optString("event_id", "")
    if (eventId.isBlank()) return false
    synchronized(seenEventIds) {
        if (seenEventIds.contains(eventId)) return true
        seenEventIds.add(eventId)
    }
    return false
}

Backend-Controlled Speech

Use this pattern when a backend service owns workflow state and decides the exact spoken line:

async def speak_line(session: SessionState, text: str) -> None:
    line = text.strip()
    if not line or session.openai_sideband is None:
        return

    if session.openai_response_active:
        await send_openai_event(session, {"type": "response.cancel"})
        session.openai_response_active = False

    session.speech_epoch += 1
    await publish_client_state(session)
    await send_openai_event(
        session,
        {
            "type": "conversation.item.create",
            "item": {
                "type": "message",
                "role": "user",
                "content": [
                    {
                        "type": "input_text",
                        "text": f"Speak exactly this line: {line}",
                    }
                ],
            },
        },
    )
    await send_openai_event(session, {"type": "response.create"})
    session.openai_response_active = True

Increment speech_epoch before replacing speech. Android should treat that value as the transcript freshness key and discard stale caption text.

Track active responses from sideband events:

Set openai_response_active = True when sending response.create or receiving response.created.
Set it back to False on response.done.
If response.cancel returns an error with code response_cancel_not_active, treat it as benign and clear the flag.

Tool Loop

Keep tools on the backend. The sideband receives completed function calls from response.done. Send a function output item, then continue only when the model should keep reasoning or speaking from that output.

async def send_tool_output(
    session: SessionState,
    *,
    call_id: str,
    result: object,
    continue_response: bool,
) -> None:
    await send_openai_event(
        session,
        {
            "type": "conversation.item.create",
            "item": {
                "type": "function_call_output",
                "call_id": call_id,
                "output": json.dumps(result),
            },
        },
    )
    if continue_response:
        await send_openai_event(session, {"type": "response.create"})

Intermediate tools (lookup, list) should generally continue. Terminal action tools should not — the backend has already updated workflow state and will speak the next line itself.

Image Injection

For vision-augmented sessions, insert the latest frame after Realtime commits the user’s audio item, then create the response exactly once.

Event Sequence

Track committed turns

On input_audio_buffer.committed, store the item_id as a pending user turn.

Wait for the item to appear

On conversation.item.added, check that the item id is pending and that the item is a user message containing input_audio.

Inject the image

Insert the latest input_image with previous_item_id set to the audio item id, then send exactly one response.create.

pending_turns: set[str] = set()
sent_images: set[str] = set()

if event["type"] == "input_audio_buffer.committed":
    pending_turns.add(event["item_id"])

if event["type"] == "conversation.item.added":
    item = event["item"]
    item_id = item["id"]
    if item_id in pending_turns and item_id not in sent_images:
        pending_turns.discard(item_id)
        if _is_user_audio_item(item):
            sent_images.add(item_id)
            await send_latest_frame(openai_sideband, item_id)

def _is_user_audio_item(item) -> bool:
    if item.get("type") != "message" or item.get("role") != "user":
        return False
    content = item.get("content") or []
    return any(part.get("type") == "input_audio" for part in content)

The frame insertion event:

await send_openai_event(
    session,
    {
        "type": "conversation.item.create",
        "previous_item_id": user_item_id,
        "item": {
            "type": "message",
            "role": "user",
            "content": [
                {
                    "type": "input_image",
                    "image_url": latest_frame_data_uri,
                    "detail": "high",
                }
            ],
        },
    },
)
await send_openai_event(session, {"type": "response.create"})

Do not inject on every conversation.item.added. Tool outputs, image items, and assistant items also appear there. Use the pending_turns / sent_images gate to inject exactly once per user audio turn.

Get Started

Core Concepts

Guides

Examples

OpenAI Realtime Integration for Rokid Glasses Apps

How It Works

Pattern Matrix

Session Configuration

System Instructions

Backend SDP Broker

Endpoint

Realtime Call Creation

Sideband Connection

Android Client Contract

Backend-Controlled Speech

Tool Loop

Image Injection

Event Sequence

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Examples

Documentation Index

​How It Works

​Pattern Matrix

​Session Configuration

​System Instructions

​Backend SDP Broker

​Endpoint

​Realtime Call Creation

​Sideband Connection

​Android Client Contract

​Backend-Controlled Speech

​Tool Loop

​Image Injection

​Event Sequence

Build docs developers (and LLMs) love

How It Works

Pattern Matrix

Session Configuration

System Instructions

Backend SDP Broker

Endpoint

Realtime Call Creation

Sideband Connection

Android Client Contract

Backend-Controlled Speech

Tool Loop

Image Injection

Event Sequence