TheDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/RealComputer/GlassKit/llms.txt
Use this file to discover all available pages before exploring further.
rokid-overshoot-openai-realtime example turns Rokid Glasses into a proactive drink-making assistant. Instead of waiting for the wearer to ask questions, the backend scans what is on the table, picks the best matching recipe from a data-driven catalog, and then guides each step with short spoken instructions while watching what the wearer does. If the wearer picks up the wrong bottle, the assistant corrects them in real time. The interaction is meant to feel more like a helpful person standing next to you than a voice assistant waiting for prompts.
It is also the most architecturally complete GlassKit example, combining seven concurrent connections across three services, a server-authoritative workflow state machine, Overshoot VLM inference with prompt switching per step, OpenAI Realtime sideband speech control, and a data-driven recipe format.
What the App Does
Scan ingredients
At session start the backend activates the inventory detector prompt on Overshoot. It waits for two consecutive identical normalized ingredient arrays before proceeding, filtering out transient detections.
Select a recipe
Once the inventory stabilizes, the backend asks OpenAI Realtime to choose a recipe by calling
list_recipes then activate_recipe using the detected ingredient names and the recipe filename keywords.Guide each step
The backend loads the chosen recipe JSON, switches to the first guided step, and patches the active Overshoot prompt to the step’s detector. It evaluates structured VLM outputs to decide whether to advance, correct, or speak progress updates.
Correct mistakes
If the wearer picks up the wrong ingredient, the backend detects the mismatch and sends an exact correction line to OpenAI Realtime via the sideband WebSocket. OpenAI speaks it over WebRTC.
Architecture
Connection Graph
End-to-End Session Flow
| Step | What happens |
|---|---|
| 1. App launch | Rokid opens the backend control WebSocket and receives a server-created session_id. |
| 2. User tap | Rokid sends session.start on the control socket. |
| 3. Media setup | Rokid sends vision SDP offer to /session/{id}/vision and audio SDP offer to /session/{id}/realtime. |
| 4. Stream ownership | Backend creates the Overshoot stream and OpenAI Realtime call; starts the Overshoot WebSocket + keepalive and the OpenAI sideband WebSocket. |
| 5. Inventory scan | Backend activates the inventory detector prompt on Overshoot and waits for two consecutive identical normalized ingredient arrays. |
| 6. Recipe selection | Backend asks OpenAI Realtime to call list_recipes then activate_recipe; loads the chosen recipe JSON and switches to the first guided step. |
| 7. Guided workflow | Backend patches the active Overshoot prompt for each step, evaluates structured results, and decides whether to advance, correct, or speak progress. Sends hud.state updates to Rokid and exact speech instructions to the OpenAI sideband. |
| 8. Speech delivery | OpenAI Realtime speaks to Rokid over WebRTC; Rokid renders only the latest transcript, keyed by speech_epoch. |
| 9. App background / close | Rokid closes the control WebSocket on onStop; backend destroys session state and tears down both media runtimes. The next foreground reconnect gets a fresh session_id. |
Implementation Contracts
- Client contract
- Backend contract
- External service contract
The Android client is intentionally thin:
- Owns HUD rendering, gesture input, runtime permission handling, and the two WebRTC links.
- Must not choose recipes, interpret vision results, advance workflow steps, or decide what speech to play.
- Must render only the latest transcript and clear stale text when
speech_epochchanges.
Requirements
- Rokid Glasses + dev cable
- Android Studio with
adb - Python 3.12 with
uv - Overshoot API key (
OVERSHOOT_API_KEY) - OpenAI API key (
OPENAI_API_KEY)
Configuration
Optional Backend Overrides
| Variable | Default | Description |
|---|---|---|
OVERSHOOT_API_URL | https://api.overshoot.ai/v0.2 | Overshoot API base URL. |
OVERSHOOT_MODEL | Qwen/Qwen3.5-27B | Overshoot model identifier. |
OPENAI_REALTIME_MODEL | gpt-realtime-1.5 | OpenAI Realtime model identifier. |
backend/session_constants.py.
Run the Backend
Run the Glasses App
Gesture Controls
| Gesture | KeyEvent | Action |
|---|---|---|
| Temple tap | KEYCODE_ENTER | Start or stop the session. |
| Swipe forward | KEYCODE_DPAD_UP | Advance one internal debug step. |
| Swipe backward | KEYCODE_DPAD_DOWN | Move back one internal debug step. |
Recipe Files
Recipes live inbackend/recipes/. Each recipe is a JSON file whose filename keywords are used by OpenAI Realtime to select the right recipe from the detected ingredients. The current example recipe is orange-juice-blue-gatorade-lime-mocktail.json.
A recipe file defines a display name, a starting step ID, a flat task list, a set of named detectors (each a VLM prompt targeting one structured output field), and an ordered list of steps. Each step references a detector, specifies an evaluation mode, and provides speech lines for entering, mismatches, and success:
Evaluation Modes
Evaluation modes are defined inbackend/session_constants.py and control how the backend interprets the VLM’s structured output for each step:
| Mode | When to use |
|---|---|
match_value | Advance when the detector field equals expected_value. |
numeric_threshold_with_progress_once | Advance once a numeric field crosses a threshold; speak progress updates along the way. |
count_rising_edges_true | Count discrete true events (e.g., a momentary action repeated N times). |
enum_progress_once_then_complete | Speak on each new enum value seen, complete on a terminal value. |
momentary_true_complete | Advance immediately on the first true result. |
Key Backend Files
| File | Description |
|---|---|
backend/main.py | FastAPI lifecycle and the control / vision / realtime routes. |
backend/session_manager.py | Composition layer wiring SessionWorkflowMixin and SessionRuntimeMixin with shared HTTP clients and session registry. |
backend/session_workflow.py | Workflow state machine: recipe activation, step evaluation, and HUD publishing. |
backend/session_runtime.py | Overshoot and OpenAI runtime creation, sideband transport, keepalive, and speech / event sending. |
backend/recipe_catalog.py | Recipe JSON loading and catalog indexing. |
backend/session_types.py | ControlSession and StepRuntimeState dataclasses. |
backend/session_constants.py | Shared constants: phase names, default model IDs, inventory scan prompt, OpenAI session instructions, and evaluation mode literals. |
backend/session_helpers.py | Pure helper functions used by the orchestrator. |
backend/recipes/*.json | Data-driven workflow definitions. Filename keywords drive recipe selection. |
Related Examples
- Live Scene Reader (Overshoot) — minimal Overshoot-only example; the streaming and relay patterns used here come directly from it.
- IKEA Assembly Assistant — simpler OpenAI Realtime example with tool calls but no Overshoot vision.