GlassKit Application Architecture: Four-Layer Design

Every GlassKit app is organized around four cooperating layers: the Rokid Glasses Android app, a WebRTC media transport, a backend that drives AI logic, and the feedback loop that surfaces results back to the wearer. This page explains what each layer owns, how data flows between them, and where offline-capable pieces fit.

The Four Layers

Rokid Glasses (Android)

Captures camera frames and microphone audio, handles touchpad gestures, and renders the monochrome HUD.

WebRTC Transport

Carries live media between the glasses, your backend, and upstream AI services over a peer-to-peer or relayed connection.

Backend

Coordinates session setup, workflow state, model calls, tool calls, and app-specific decisions.

Wearer Feedback

Delivers real-time guidance via the HUD display, speakers, or data-channel events driving UI updates.

Layer 1 — Rokid Glasses (Android)

The Rokid Glasses Android app is the on-device half of your system. It owns:

Camera capture — rear-facing 1024×768 @ 15 fps via CameraX, with an Application-level back-camera limiter to avoid front-camera validation retries.
Microphone capture — 16 kHz mono PCM16 via AudioRecord or JavaAudioDeviceModule when WebRTC manages the audio path.
Touchpad input — four gestures (tap, double-tap, swipe forward, swipe backward) mapped to KeyEvent codes and OnBackPressedCallback.
HUD rendering — a 480×640 portrait monochrome display. Black pixels are transparent; white pixels appear green. Content is rendered with Android views inside a HudViewportLayout.

The Android app is intentionally thin. It streams media outbound, handles input, and updates the HUD based on instructions that arrive from the backend over a data channel or WebSocket. It does not run AI models by default.

Rokid Glasses have less CPU and RAM than phones. Keep on-device logic efficient. Avoid heavy per-frame processing unless the feature specifically requires local inference.

Layer 2 — WebRTC Transport

WebRTC is the media highway between the glasses and everything else. The Android side creates a PeerConnection, adds camera and microphone tracks, then exchanges SDP with the backend to establish the session. Two integration shapes are common:

Backend Media Receiver
Backend Service Broker

Android sends an SDP offer directly to your Python or Node backend. The backend terminates the WebRTC connection, receives the video and audio tracks, runs inference or processing on them, and sends results back on a data channel.

Glasses ──SDP offer──► Backend (aiortc)
        ◄─SDP answer──
        ══video/audio═►
        ◄══data events═

Android sends an SDP offer to your backend. The backend forwards it (or adapts it) to an upstream provider like OpenAI Realtime, then relays the provider’s answer SDP back to Android. The backend also bridges control events between Android and the provider.

Glasses ──SDP offer──► Backend ──session create──► Provider
        ◄─SDP answer──         ◄─answer SDP──────
        ══video/audio════════════════════════════►
        ◄══events/audio════════════════════════

WebRTC uses non-trickle signaling in GlassKit examples: ICE gathering completes before the offer is sent, so both sides receive a complete SDP in one round-trip.

Layer 3 — Backend

The backend is where AI logic lives. Depending on the app, it may:

Receive live video frames from the WebRTC track and run a vision model (e.g., RF-DETR object detection, Overshoot scene description).
Connect to a speech-to-text or voice model (e.g., OpenAI Realtime) and broker audio between the glasses and the model.
Maintain workflow state — for example, tracking assembly steps, recognizing when one step is complete, and advancing to the next.
Invoke tools — querying a database, updating records, calling external APIs — and surface the results as HUD updates.

The backend communicates results to Android over a WebRTC data channel (for sessions already connected) or a separate WebSocket for out-of-band control events.

Python backends typically use aiortc to terminate WebRTC media. FastAPI works well as the HTTP + WebSocket host. See the WebRTC Streaming page for full backend patterns.

Layer 4 — Wearer Feedback

The feedback layer closes the loop. Results from the backend reach the wearer through:

HUD text and graphics — updated via data-channel events that drive ScreenController.render() calls on the Android side.
Spoken audio — either streamed from a voice model over the WebRTC audio track, or played locally from the Android speaker.
Navigation state — the current active screen and available touchpad actions update as the workflow advances.

The wearer interacts back through touchpad gestures or microphone input, which feeds into the next backend decision cycle.

Data Flow at a Glance

┌─────────────────────────────────────┐
│         Rokid Glasses (Android)     │
│                                     │
│  Camera ──► WebRTC video track ─────┼──► Backend
│  Mic    ──► WebRTC audio track ─────┼──►   │
│  Touchpad ─► ScreenController ──────┼──►   │ (AI / models / tools)
│                                     │◄─────┘
│  HUD ◄── render() ◄── data channel  │
│  Speaker ◄── remote audio track     │
└─────────────────────────────────────┘

Offline-Capable Pieces

Not every feature requires a live backend connection. Some pieces can run fully on the device:

Vosk Voice Commands

Offline speech recognition for short command words (start, stop, confirm). Useful as a fallback when Wi-Fi is unavailable or as a complement to touchpad navigation.

Local Vision / Privacy Processing

CameraX frames can be processed on-device for privacy filtering or simple CV tasks before streaming, so sensitive data never leaves the glasses.

Rokid Glasses do not have cellular networking. All networked features depend on device Wi-Fi or a phone companion app. Design offline fallbacks for environments with unreliable connectivity.

Example Apps

The GlassKit repository includes several reference implementations that show this architecture in action:

Example	Layers in use
`rokid-feature-demo`	Android layer only — touchpad, camera, mic, HUD, offline Vosk
`rokid-overshoot`	Android + WebRTC + backend receiver (Overshoot scene description)
`rokid-openai-realtime`	Android + WebRTC + backend broker (OpenAI Realtime voice)
`rokid-rfdetr`	Android + WebRTC + backend receiver (RF-DETR object detection)
`rokid-overshoot-openai-realtime`	All four layers — video inference, voice, workflow state, HUD guidance

Each example directory contains a README with setup steps and environment variables.

Get Started

Core Concepts

Guides

Examples

GlassKit Application Architecture: Four-Layer Design

The Four Layers

Rokid Glasses (Android)

WebRTC Transport

Backend

Wearer Feedback

Layer 1 — Rokid Glasses (Android)

Layer 2 — WebRTC Transport

Layer 3 — Backend

Layer 4 — Wearer Feedback

Data Flow at a Glance

Offline-Capable Pieces

Vosk Voice Commands

Local Vision / Privacy Processing

Example Apps

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Examples

Documentation Index

​The Four Layers

Rokid Glasses (Android)

WebRTC Transport

Backend

Wearer Feedback

​Layer 1 — Rokid Glasses (Android)

​Layer 2 — WebRTC Transport

​Layer 3 — Backend

​Layer 4 — Wearer Feedback

​Data Flow at a Glance

​Offline-Capable Pieces

Vosk Voice Commands

Local Vision / Privacy Processing

​Example Apps

Build docs developers (and LLMs) love

The Four Layers

Layer 1 — Rokid Glasses (Android)

Layer 2 — WebRTC Transport

Layer 3 — Backend

Layer 4 — Wearer Feedback

Data Flow at a Glance

Offline-Capable Pieces

Example Apps