Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/RealComputer/GlassKit/llms.txt

Use this file to discover all available pages before exploring further.

OpenAI Realtime is a low-latency, stateful API for speech-to-speech and multimodal sessions. This page explains how to wire it up with Rokid Glasses — covering the two-plane architecture, three integration patterns, the backend SDP broker, Android client setup, image injection for vision-augmented responses, and the tool loop.

How It Works

Think in two planes:
  • Media plane — Android’s WebRTC peer connection carries microphone audio, optional camera media, and remote assistant audio back to the glasses speaker.
  • Control plane — JSON events move over the oai-events WebRTC data channel and a backend WebSocket sideband channel attached to the same Realtime call.
Android sends an SDP offer to your backend. The backend creates the Realtime call and forwards the SDP answer back to Android. For personal or hackathon work you can POST the offer directly to https://api.openai.com/v1/realtime/calls with a local API key, but do not ship that pattern — it embeds the key in the APK. The important objects are:
  • Session — model, voice, instructions, audio config, tools, turn detection, and output modalities.
  • Conversation item — a user message, assistant message, tool call, tool output, image input, or audio item.
  • Response — one model turn. In the default path Realtime creates a response automatically after VAD detects the end of speech. Use response.create when the backend adds an item that needs a model answer.
  • Sideband — a server-side control channel attached to the call, used for tools, workflow state, guardrails, and image injection.
Use gpt-realtime-2 for new integrations. Start with reasoning.effort: "low" for responsive speech-to-speech behavior, then raise it only for workflows that need deeper multi-step planning.

Pattern Matrix

Three patterns cover most Rokid Glasses use cases. Choose based on whether Android owns the mic, whether vision augmentation is required, and whether the backend drives each spoken turn.
PatternAndroid media into RealtimeSession input configResponse triggerBackend sideband role
Direct assistantMic audio, optional camera videosemantic_vad; transcription enabled when client renders user textAutomatic VAD responseHandle tools and optional observability
Backend-augmented visionMic audio only; camera goes to backend vision serviceSame as direct assistant, but create_response = FalseBackend waits for committed user audio item, injects image/context, then sends response.createStore latest vision result, inject context, handle tools
Output-only backend speechNo mic track; receive-only audio transceiverUsually omit audio.input; configure only output voiceBackend creates text items and sends response.createOwn workflow state, cancel/replace speech, handle tools

Session Configuration

Use this when the model can own the conversation. Android streams microphone audio and optionally camera media. Keep automatic turn creation enabled.
session_config = {
    "type": "realtime",
    "model": "gpt-realtime-2",
    "reasoning": {"effort": "low"},
    "audio": {
        "input": {
            "noise_reduction": {"type": "near_field"},
            "transcription": {"language": "en", "model": "whisper-1"},
            "turn_detection": {"type": "semantic_vad"},
        },
        "output": {"voice": "marin"},
    },
    "instructions": SESSION_INSTRUCTIONS,
    "tools": [...],
}

System Instructions

Realtime instructions define the assistant’s role, speaking style, visual grounding rules, and tool policy. For smart glasses, keep spoken output short and explicitly handle unclear audio or poor framing.
SESSION_INSTRUCTIONS = """
# Role
- You are a voice assistant running on smart glasses.
- Help the user complete the current real-world task using speech, tool results, and the latest visual context.

# Speaking Style
- Be concise, concrete, and actionable.
- Use no more than two short sentences per response unless the user asks for detail.
- Do not use sound effects, filler, or stage directions.

# Visual Grounding
- Treat the camera view as the user's current field of view.
- If the image is unclear, blocked, or missing the relevant object, ask the user to adjust their view.
- Do not claim that you can see an object unless the current visual context supports it.

# Tools and Backend State
- Call backend tools for private data, workflow decisions, or external actions.
- Do not invent step progression when the backend owns the workflow state.
- If the user's message starts with `Speak exactly this line:`, speak that line exactly and do not add commentary.
""".strip()

Backend SDP Broker

The backend accepts the SDP offer from Android, forwards it to OpenAI Realtime as a multipart form, and returns the SDP answer. It also opens the sideband WebSocket for control messages.

Endpoint

@app.post("/session/{session_id}/realtime")
async def create_realtime_session(session_id: str, request: Request) -> Response:
    offer_sdp = (await request.body()).decode()
    if not offer_sdp.strip():
        raise HTTPException(status_code=422, detail="offer SDP must not be empty")

    answer_sdp = await session_manager.create_realtime_session(session_id, offer_sdp)
    return Response(content=answer_sdp, media_type="application/sdp")

Realtime Call Creation

session_config = {
    "type": "realtime",
    "model": "gpt-realtime-2",
    "reasoning": {"effort": "low"},
    "audio": {
        "input": {
            "noise_reduction": {"type": "near_field"},
            "transcription": {"language": "en", "model": "whisper-1"},
            "turn_detection": {
                "type": "semantic_vad",
                "create_response": False,
                "interrupt_response": False,
            },
        },
        "output": {"voice": "cedar"},
    },
    "instructions": SESSION_INSTRUCTIONS,
}

form = {
    "sdp": (None, offer_sdp),
    "session": (None, json.dumps(session_config)),
}

upstream = await openai_http.post(
    "https://api.openai.com/v1/realtime/calls",
    headers={"Authorization": f"Bearer {openai_api_key}"},
    files=form,
)
upstream.raise_for_status()

answer_sdp = normalize_sdp(upstream.text)
call_id = upstream.headers["location"].rstrip("/").split("/")[-1]
Validate both outputs before returning to Android:
if not call_id or not answer_sdp.startswith("v="):
    raise HTTPException(
        status_code=502,
        detail="OpenAI Realtime response missing call_id or valid answer SDP",
    )

Sideband Connection

sideband_url = f"wss://api.openai.com/v1/realtime?call_id={call_id}"
async with websockets.connect(
    sideband_url,
    additional_headers={"Authorization": f"Bearer {openai_api_key}"},
) as openai_sideband:
    async for raw in openai_sideband:
        event = json.loads(raw)
        ...
The sideband is the backend’s control channel. It can monitor session events, send session.update, call tools, insert conversation items, cancel active speech, and create responses.

Android Client Contract

1

Create the data channel

Use oai-events as the data channel name. Create it before generating the offer.
val eventsChannel = peerConnection.createDataChannel(
    "oai-events",
    DataChannel.Init()
)
2

Add tracks and transceivers

Create data channels, local tracks, and receive-only transceivers before calling createOffer. For direct assistant mode, add a local microphone audio track and set OfferToReceiveAudio to "true". For output-only mode, add a receive-only audio transceiver instead of a mic track.
3

Complete the SDP exchange

Wait for ICE gathering, POST the full local SDP to your backend /session/{id}/realtime endpoint, normalize the answer SDP, then set the remote description.
4

Handle events and transcripts

For direct assistant mode parse these events from oai-events:
  • conversation.item.input_audio_transcription.completed — user text
  • response.output_audio_transcript.done — final assistant text
  • response.output_audio_transcript.delta — live captions only
For output-only backend speech, render transcript deltas only for the current speech item and clear stale text when the backend speech_epoch changes.
Deduplicate server events by event_id where possible:
private fun shouldIgnoreEvent(json: JSONObject): Boolean {
    val eventId = json.optString("event_id", "")
    if (eventId.isBlank()) return false
    synchronized(seenEventIds) {
        if (seenEventIds.contains(eventId)) return true
        seenEventIds.add(eventId)
    }
    return false
}

Backend-Controlled Speech

Use this pattern when a backend service owns workflow state and decides the exact spoken line:
async def speak_line(session: SessionState, text: str) -> None:
    line = text.strip()
    if not line or session.openai_sideband is None:
        return

    if session.openai_response_active:
        await send_openai_event(session, {"type": "response.cancel"})
        session.openai_response_active = False

    session.speech_epoch += 1
    await publish_client_state(session)
    await send_openai_event(
        session,
        {
            "type": "conversation.item.create",
            "item": {
                "type": "message",
                "role": "user",
                "content": [
                    {
                        "type": "input_text",
                        "text": f"Speak exactly this line: {line}",
                    }
                ],
            },
        },
    )
    await send_openai_event(session, {"type": "response.create"})
    session.openai_response_active = True
Increment speech_epoch before replacing speech. Android should treat that value as the transcript freshness key and discard stale caption text.
Track active responses from sideband events:
  • Set openai_response_active = True when sending response.create or receiving response.created.
  • Set it back to False on response.done.
  • If response.cancel returns an error with code response_cancel_not_active, treat it as benign and clear the flag.

Tool Loop

Keep tools on the backend. The sideband receives completed function calls from response.done. Send a function output item, then continue only when the model should keep reasoning or speaking from that output.
async def send_tool_output(
    session: SessionState,
    *,
    call_id: str,
    result: object,
    continue_response: bool,
) -> None:
    await send_openai_event(
        session,
        {
            "type": "conversation.item.create",
            "item": {
                "type": "function_call_output",
                "call_id": call_id,
                "output": json.dumps(result),
            },
        },
    )
    if continue_response:
        await send_openai_event(session, {"type": "response.create"})
Intermediate tools (lookup, list) should generally continue. Terminal action tools should not — the backend has already updated workflow state and will speak the next line itself.

Image Injection

For vision-augmented sessions, insert the latest frame after Realtime commits the user’s audio item, then create the response exactly once.

Event Sequence

1

Track committed turns

On input_audio_buffer.committed, store the item_id as a pending user turn.
2

Wait for the item to appear

On conversation.item.added, check that the item id is pending and that the item is a user message containing input_audio.
3

Inject the image

Insert the latest input_image with previous_item_id set to the audio item id, then send exactly one response.create.
pending_turns: set[str] = set()
sent_images: set[str] = set()

if event["type"] == "input_audio_buffer.committed":
    pending_turns.add(event["item_id"])

if event["type"] == "conversation.item.added":
    item = event["item"]
    item_id = item["id"]
    if item_id in pending_turns and item_id not in sent_images:
        pending_turns.discard(item_id)
        if _is_user_audio_item(item):
            sent_images.add(item_id)
            await send_latest_frame(openai_sideband, item_id)

def _is_user_audio_item(item) -> bool:
    if item.get("type") != "message" or item.get("role") != "user":
        return False
    content = item.get("content") or []
    return any(part.get("type") == "input_audio" for part in content)
The frame insertion event:
await send_openai_event(
    session,
    {
        "type": "conversation.item.create",
        "previous_item_id": user_item_id,
        "item": {
            "type": "message",
            "role": "user",
            "content": [
                {
                    "type": "input_image",
                    "image_url": latest_frame_data_uri,
                    "detail": "high",
                }
            ],
        },
    },
)
await send_openai_event(session, {"type": "response.create"})
Do not inject on every conversation.item.added. Tool outputs, image items, and assistant items also appear there. Use the pending_turns / sent_images gate to inject exactly once per user audio turn.

Build docs developers (and LLMs) love