Proactive glasses apps continuously observe real-world context and react when something meaningful changes — they are not chat flows waiting for user prompts. This page explains the core loop, perception providers, observation contract, stabilization rules, workflow authority, and a concrete example you can copy. Use this pattern when camera, audio, sensor, or backend observations should guide, alert, adapt, or trigger actions without requiring an explicit command for every step.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/RealComputer/GlassKit/llms.txt
Use this file to discover all available pages before exploring further.
Core Loop
The proactive perception loop has five stages that run continuously while the app is active:Perception providers should report observations. The app controller should own workflow changes and effects. Keeping these responsibilities separate makes providers swappable.
Perception Loop Options
The perception loop can use any provider that turns live context into normalized observations:VLM Inference
Continuous VLM inference via a service like Overshoot over a live WebRTC stream. Returns structured results at configurable intervals.
Object Detection
Fine-tuned or open-vocabulary detector running on the backend. Returns class labels, bounding boxes, and confidence scores.
OCR / Markers
Optical character recognition, barcode, or fiducial marker detection for text and code reading.
Audio Events
Keyword detection, sound classification, or backend audio analysis triggered by the microphone stream.
Periodic Image Turns
Scheduled image turns sent to a realtime model for description or classification, without a full live stream.
Hand / Pose / Gesture
Hand, body pose, or gesture detection for spatial interaction signals.
Observation Contract
Normalize provider output before any app logic sees it. Avoid letting raw VLM text, detector envelopes, transcripts, or provider-specific schemas directly drive behavior. Prefer small app-owned observation events:generation, active task_id, prompt id, detector id, or equivalent field when the perception request changes over time. Add a timestamp only when time-based stabilization needs it.
Stabilization Rules
Continuous inference is noisy. The stabilization layer sits between raw observations and app behavior — it decides when an observation is trustworthy enough to act on. Useful rules include:| Rule | When to use |
|---|---|
| N consecutive observations | Simple debounce for object presence or scene state |
| M milliseconds condition holds | Time-based confirmation before important transitions |
| Rising edge count | Repeated physical actions (e.g. pour, tap) |
| Confidence threshold | Filters low-quality model outputs |
| Multi-provider agreement | High-stakes decisions requiring corroboration |
| Explicit user confirmation | Actions that are hard to undo or have side effects |
| Generation / task match | Discard stale results after the backend changes the active task |
Workflow Authority
Workflow authority means one controller owns the current app state, active perception request, and the effect of each observation. For Rokid-class glasses apps, this should usually be the backend. Serialize per-session workflow changes through that controller before mutating state. For each workflow state, define:- Active perception query — what the perception loop is looking for in this state.
- Observation schema — the normalized structure accepted in this state.
- Trigger or stabilization rule — how many observations, over what time, before acting.
- Transition, feedback, or action — what happens when a valid observation arrives.
- Invalidation rule — how old perception results are discarded when the state changes.
Example: Drink-Making Coach
A concrete example of this pattern is the proactive drink-making coach in GlassKit. It watches ingredients, picks a recipe based on what it sees, and guides the wearer through each step with spoken audio:rokid-overshoot-openai-realtime
Proactive drink-making coach combining Overshoot for continuous VLM inference and OpenAI Realtime for spoken guidance. Its backend queues perception, OpenAI, control, and disconnect events through one session loop. Android streams media, sends gestures, and renders HUD and transcripts.
Related Guides
Object Detection
Detector events, confirmation rules, normalized results, and detection-driven task progression.
OpenAI Realtime
Backend-controlled speech, image injection after user audio turns, and sideband control.