Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/RealComputer/GlassKit/llms.txt

Use this file to discover all available pages before exploring further.

The rokid-overshoot-openai-realtime example turns Rokid Glasses into a proactive drink-making assistant. Instead of waiting for the wearer to ask questions, the backend scans what is on the table, picks the best matching recipe from a data-driven catalog, and then guides each step with short spoken instructions while watching what the wearer does. If the wearer picks up the wrong bottle, the assistant corrects them in real time. The interaction is meant to feel more like a helpful person standing next to you than a voice assistant waiting for prompts. It is also the most architecturally complete GlassKit example, combining seven concurrent connections across three services, a server-authoritative workflow state machine, Overshoot VLM inference with prompt switching per step, OpenAI Realtime sideband speech control, and a data-driven recipe format.

What the App Does

1

Scan ingredients

At session start the backend activates the inventory detector prompt on Overshoot. It waits for two consecutive identical normalized ingredient arrays before proceeding, filtering out transient detections.
2

Select a recipe

Once the inventory stabilizes, the backend asks OpenAI Realtime to choose a recipe by calling list_recipes then activate_recipe using the detected ingredient names and the recipe filename keywords.
3

Guide each step

The backend loads the chosen recipe JSON, switches to the first guided step, and patches the active Overshoot prompt to the step’s detector. It evaluates structured VLM outputs to decide whether to advance, correct, or speak progress updates.
4

Correct mistakes

If the wearer picks up the wrong ingredient, the backend detects the mismatch and sends an exact correction line to OpenAI Realtime via the sideband WebSocket. OpenAI speaks it over WebRTC.
5

Deliver spoken guidance

OpenAI Realtime speaks to the glasses over WebRTC audio. The glasses render the latest transcript on the HUD. When the backend starts a new speech turn (speech_epoch changes), stale transcript text is cleared.

Architecture

Connection Graph

Rokid Glasses (Android)
  ├── Control WebSocket ←→ Backend (FastAPI)
  │     session lifecycle, HUD state, debug gestures
  ├── HTTP → Backend (FastAPI)
  │     SDP offer setup for vision and audio links
  ├── Vision WebRTC (video) → Overshoot (direct, brokered by backend)
  ├── Audio WebRTC (audio + data) ←→ OpenAI Realtime (direct, brokered by backend)

Backend (FastAPI)
  ├── HTTP → Overshoot  (stream create, prompt patch)
  ├── WebSocket ←→ Overshoot  (inference events, keepalive)
  └── WebSocket ←→ OpenAI Realtime sideband
        (recipe selection tool calls, exact speech lines)
The glasses own only the media connections and the HUD renderer. The backend is authoritative for all workflow decisions.

End-to-End Session Flow

StepWhat happens
1. App launchRokid opens the backend control WebSocket and receives a server-created session_id.
2. User tapRokid sends session.start on the control socket.
3. Media setupRokid sends vision SDP offer to /session/{id}/vision and audio SDP offer to /session/{id}/realtime.
4. Stream ownershipBackend creates the Overshoot stream and OpenAI Realtime call; starts the Overshoot WebSocket + keepalive and the OpenAI sideband WebSocket.
5. Inventory scanBackend activates the inventory detector prompt on Overshoot and waits for two consecutive identical normalized ingredient arrays.
6. Recipe selectionBackend asks OpenAI Realtime to call list_recipes then activate_recipe; loads the chosen recipe JSON and switches to the first guided step.
7. Guided workflowBackend patches the active Overshoot prompt for each step, evaluates structured results, and decides whether to advance, correct, or speak progress. Sends hud.state updates to Rokid and exact speech instructions to the OpenAI sideband.
8. Speech deliveryOpenAI Realtime speaks to Rokid over WebRTC; Rokid renders only the latest transcript, keyed by speech_epoch.
9. App background / closeRokid closes the control WebSocket on onStop; backend destroys session state and tears down both media runtimes. The next foreground reconnect gets a fresh session_id.

Implementation Contracts

The Android client is intentionally thin:
  • Owns HUD rendering, gesture input, runtime permission handling, and the two WebRTC links.
  • Must not choose recipes, interpret vision results, advance workflow steps, or decide what speech to play.
  • Must render only the latest transcript and clear stale text when speech_epoch changes.

Requirements

  • Rokid Glasses + dev cable
  • Android Studio with adb
  • Python 3.12 with uv
  • Overshoot API key (OVERSHOOT_API_KEY)
  • OpenAI API key (OPENAI_API_KEY)

Configuration

1

Configure the glasses app

Set the backend base URL in rokid/local.properties:
BACKEND_BASE_URL=http://<YOUR_BACKEND>
2

Configure the backend

cd backend
cp .env.example .env
# Set OVERSHOOT_API_KEY and OPENAI_API_KEY in .env

Optional Backend Overrides

VariableDefaultDescription
OVERSHOOT_API_URLhttps://api.overshoot.ai/v0.2Overshoot API base URL.
OVERSHOOT_MODELQwen/Qwen3.5-27BOvershoot model identifier.
OPENAI_REALTIME_MODELgpt-realtime-1.5OpenAI Realtime model identifier.
Default values for these and the processing config are defined in backend/session_constants.py.

Run the Backend

cd backend
uv run --env-file .env fastapi dev main.py --host 0.0.0.0

Run the Glasses App

1

Connect Rokid Glasses and enable Wi-Fi

adb devices
adb shell cmd wifi status
adb shell cmd wifi set-wifi-enabled enabled
adb shell 'cmd wifi connect-network "NAME" wpa2 "PASSWORD"'
adb shell cmd wifi status
2

Optional: wireless ADB

adb shell ip -f inet addr show wlan0
ping -c 5 -W 3 <IP>
adb tcpip 5555
adb connect <IP>
adb devices
3

Build and run

Open the rokid/ directory in Android Studio, select Rokid Glasses, and run the app.
cd rokid && ./gradlew :app:assembleDebug

Gesture Controls

GestureKeyEventAction
Temple tapKEYCODE_ENTERStart or stop the session.
Swipe forwardKEYCODE_DPAD_UPAdvance one internal debug step.
Swipe backwardKEYCODE_DPAD_DOWNMove back one internal debug step.

Recipe Files

Recipes live in backend/recipes/. Each recipe is a JSON file whose filename keywords are used by OpenAI Realtime to select the right recipe from the detected ingredients. The current example recipe is orange-juice-blue-gatorade-lime-mocktail.json. A recipe file defines a display name, a starting step ID, a flat task list, a set of named detectors (each a VLM prompt targeting one structured output field), and an ordered list of steps. Each step references a detector, specifies an evaluation mode, and provides speech lines for entering, mismatches, and success:
{
  "id": "orange-juice-blue-gatorade-lime-mocktail",
  "display_name": "Orange Blue Mocktail",
  "start_step_id": "pick_orange_bottle",
  "tasks": [
    { "id": "orange_juice", "text": "Add orange juice to halfway" },
    { "id": "ice",          "text": "Add a few scoops of ice" },
    { "id": "gatorade",     "text": "Fill with Gatorade" },
    { "id": "lime",         "text": "Top with a lime wheel" }
  ],
  "detectors": {
    "bottle_color": {
      "field": "color",
      "prompt": "Return JSON with a single `color` field. Return the bottle color as orange, red, or blue only when the bottle is clearly being held by a hand in the air..."
    }
  },
  "steps": [
    {
      "id": "pick_orange_bottle",
      "task_id": "orange_juice",
      "evaluation_mode": "match_value",
      "detector_key": "bottle_color",
      "expected_value": "orange",
      "ignored_values": ["unknown"],
      "speak_on_observation_change_only": true,
      "on_enter_speech": "Let's make a mocktail! Grab the orange bottle to get us started.",
      "mismatch_speech": "That's not the orange bottle. Grab the orange bottle.",
      "success_speech": "Nice. Pour the orange juice into the glass.",
      "next_step_id": "pour_orange_juice"
    }
  ]
}

Evaluation Modes

Evaluation modes are defined in backend/session_constants.py and control how the backend interprets the VLM’s structured output for each step:
ModeWhen to use
match_valueAdvance when the detector field equals expected_value.
numeric_threshold_with_progress_onceAdvance once a numeric field crosses a threshold; speak progress updates along the way.
count_rising_edges_trueCount discrete true events (e.g., a momentary action repeated N times).
enum_progress_once_then_completeSpeak on each new enum value seen, complete on a terminal value.
momentary_true_completeAdvance immediately on the first true result.

Key Backend Files

FileDescription
backend/main.pyFastAPI lifecycle and the control / vision / realtime routes.
backend/session_manager.pyComposition layer wiring SessionWorkflowMixin and SessionRuntimeMixin with shared HTTP clients and session registry.
backend/session_workflow.pyWorkflow state machine: recipe activation, step evaluation, and HUD publishing.
backend/session_runtime.pyOvershoot and OpenAI runtime creation, sideband transport, keepalive, and speech / event sending.
backend/recipe_catalog.pyRecipe JSON loading and catalog indexing.
backend/session_types.pyControlSession and StepRuntimeState dataclasses.
backend/session_constants.pyShared constants: phase names, default model IDs, inventory scan prompt, OpenAI session instructions, and evaluation mode literals.
backend/session_helpers.pyPure helper functions used by the orchestrator.
backend/recipes/*.jsonData-driven workflow definitions. Filename keywords drive recipe selection.

Build docs developers (and LLMs) love