Every GlassKit app is organized around four cooperating layers: the Rokid Glasses Android app, a WebRTC media transport, a backend that drives AI logic, and the feedback loop that surfaces results back to the wearer. This page explains what each layer owns, how data flows between them, and where offline-capable pieces fit.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/RealComputer/GlassKit/llms.txt
Use this file to discover all available pages before exploring further.
The Four Layers
Rokid Glasses (Android)
Captures camera frames and microphone audio, handles touchpad gestures, and renders the monochrome HUD.
WebRTC Transport
Carries live media between the glasses, your backend, and upstream AI services over a peer-to-peer or relayed connection.
Backend
Coordinates session setup, workflow state, model calls, tool calls, and app-specific decisions.
Wearer Feedback
Delivers real-time guidance via the HUD display, speakers, or data-channel events driving UI updates.
Layer 1 — Rokid Glasses (Android)
The Rokid Glasses Android app is the on-device half of your system. It owns:- Camera capture — rear-facing 1024×768 @ 15 fps via CameraX, with an Application-level back-camera limiter to avoid front-camera validation retries.
- Microphone capture — 16 kHz mono PCM16 via
AudioRecordorJavaAudioDeviceModulewhen WebRTC manages the audio path. - Touchpad input — four gestures (tap, double-tap, swipe forward, swipe backward) mapped to
KeyEventcodes andOnBackPressedCallback. - HUD rendering — a 480×640 portrait monochrome display. Black pixels are transparent; white pixels appear green. Content is rendered with Android views inside a
HudViewportLayout.
Rokid Glasses have less CPU and RAM than phones. Keep on-device logic efficient. Avoid heavy per-frame processing unless the feature specifically requires local inference.
Layer 2 — WebRTC Transport
WebRTC is the media highway between the glasses and everything else. The Android side creates aPeerConnection, adds camera and microphone tracks, then exchanges SDP with the backend to establish the session.
Two integration shapes are common:
- Backend Media Receiver
- Backend Service Broker
Android sends an SDP offer directly to your Python or Node backend. The backend terminates the WebRTC connection, receives the video and audio tracks, runs inference or processing on them, and sends results back on a data channel.
Layer 3 — Backend
The backend is where AI logic lives. Depending on the app, it may:- Receive live video frames from the WebRTC track and run a vision model (e.g., RF-DETR object detection, Overshoot scene description).
- Connect to a speech-to-text or voice model (e.g., OpenAI Realtime) and broker audio between the glasses and the model.
- Maintain workflow state — for example, tracking assembly steps, recognizing when one step is complete, and advancing to the next.
- Invoke tools — querying a database, updating records, calling external APIs — and surface the results as HUD updates.
Layer 4 — Wearer Feedback
The feedback layer closes the loop. Results from the backend reach the wearer through:- HUD text and graphics — updated via data-channel events that drive
ScreenController.render()calls on the Android side. - Spoken audio — either streamed from a voice model over the WebRTC audio track, or played locally from the Android speaker.
- Navigation state — the current active screen and available touchpad actions update as the workflow advances.
Data Flow at a Glance
Offline-Capable Pieces
Not every feature requires a live backend connection. Some pieces can run fully on the device:Vosk Voice Commands
Offline speech recognition for short command words (start, stop, confirm). Useful as a fallback when Wi-Fi is unavailable or as a complement to touchpad navigation.
Local Vision / Privacy Processing
CameraX frames can be processed on-device for privacy filtering or simple CV tasks before streaming, so sensitive data never leaves the glasses.
Rokid Glasses do not have cellular networking. All networked features depend on device Wi-Fi or a phone companion app. Design offline fallbacks for environments with unreliable connectivity.
Example Apps
The GlassKit repository includes several reference implementations that show this architecture in action:| Example | Layers in use |
|---|---|
rokid-feature-demo | Android layer only — touchpad, camera, mic, HUD, offline Vosk |
rokid-overshoot | Android + WebRTC + backend receiver (Overshoot scene description) |
rokid-openai-realtime | Android + WebRTC + backend broker (OpenAI Realtime voice) |
rokid-rfdetr | Android + WebRTC + backend receiver (RF-DETR object detection) |
rokid-overshoot-openai-realtime | All four layers — video inference, voice, workflow state, HUD guidance |