Object detection on Rokid Glasses is most useful when your app needs deterministic visual signals from the outward camera — object presence, class labels, bounding boxes, counters, completion triggers, or annotated frames for realtime model augmentation. This page covers the full pipeline: Android stream setup, backend receiver, frame policy, normalized results, decision logic, event contracts, annotated frames, and model augmentation.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/RealComputer/GlassKit/llms.txt
Use this file to discover all available pages before exploring further.
Architecture
The common object-detection shape is a five-step pipeline:Android streams
Android sends the video to a backend vision endpoint over WebRTC with a data channel for control events.
Backend runs detection
The backend receives video, runs detection on the latest useful frame, and normalizes model output into app-owned structures.
Backend publishes events
The backend publishes normalized app state and domain events to Android over the data channel or a control WebSocket.
Android should not interpret raw model envelopes. It should consume normalized app state — status, detected classes, active task, counters — and let the backend own task progression.
Android Stream Setup
Use a separate camera WebRTC session when detection is not the main realtime media path. Start with the lowest supported capture mode that still supports the detector — on Rokid, use1024×768 @ 15 fps capture and throttle detection or WebRTC output lower when needed.
Create any data channel before the offer if detection events need to move over the same peer connection. Send explicit app events such as session.start, run.start, debug.step, or workflow.confirm, and queue client events until the channel or control socket is open.
Backend Receiver
Useaiortc for WebRTC termination in Python rather than hand-rolling SDP or media parsing. Keep the receiver thin: accept the media stream, hand frames to a vision processor, and publish normalized app events.
Frame Policy
Object detection should optimize freshness, not throughput. A camera-glasses app that reacts to stale frames feels wrong even when inference is accurate.Latest-Frame Buffer
Keep only the newest frame while inference runs. Works well when the model is slower than the camera stream.
Minimum Interval
Skip frames until
now - last_processed >= min_interval_s. Simple and works well for image augmentation.One In-Flight
If a frame is being processed, drop incoming frames instead of building a queue. Prevents memory and latency spikes.
Normalized Results
Normalize every detector into a small app-owned structure before any workflow or client code sees it:- Map provider labels to domain labels on the backend.
- Include confidence and timestamp if downstream logic needs stability checks.
- Make the bounding-box convention explicit.
box_xyxymeans left, top, right, bottom in source-image pixels; use a_normsuffix or acoordinate_spacefield for normalized boxes. - Prefer a list of detection objects over parallel
labels,boxes, andconfidencesarrays. - Keep raw predictions available only in logs or debug traces.
- Use stable event types and field names. Android should ignore unknown fields but should not need provider-specific parsing.
Model Backend Choices
Fine-Tuned Detector
Best for a known set of physical objects, parts, states, or completion markers. RF-DETR is a good concrete example.
Open-Vocabulary Detector
Useful during prototyping. Stabilize labels before wiring completion rules to them.
Hosted Detector Service
Fastest to prototype. Normalize results and hide vendor auth from Android.
Decision Logic
Do not let a single detection immediately mutate important user-visible state unless the workflow truly tolerates false positives. Add a confirmation rule between normalized detections and app state.Two-Hit Rule
A two-hit rule works well for simple glasses demos:Other Useful Rules
- Presence over time — require a class to appear for N frames or M milliseconds.
- Rising edge count — count false-to-true transitions, useful for repeated actions.
- Best-confidence match — choose the highest-confidence object among allowed labels.
- Region rule — require the object box to be inside a known image region.
- Generation match — ignore detector results from an old task generation after the backend switches tasks.
Event Contracts
Backend → Android
| Event type | Purpose |
|---|---|
config | Detector labels, workflow steps, or task metadata |
state | Normalized status, active task, counters, latest detection summary |
detection | Optional debug-only detection snapshot |
Domain event (e.g. task.completed) | Workflow transitions |
state event:
Android → Backend
| Event type | Purpose |
|---|---|
session.start / run.start | Begin a detection session or run |
debug.step | Manual step with a direction or target id |
workflow.confirm | Explicit user confirmation |
session.stop | App exit |
Annotated Frames
Annotated frames are useful for debugging, model tuning, and realtime model augmentation. Usesupervision, OpenCV, PIL, or the detector library’s own helpers to draw boxes and labels.
Save:
latest.jpg— for quick inspection and downstream image augmentation.- A bounded timestamped history — when debugging regressions.
Realtime Model Augmentation
Object detection can provide either structured context or the latest annotated image to a realtime model:- Use structured text when labels, counts, or state are enough.
- Use an annotated
input_imagewhen spatial layout, part appearance, or visual ambiguity matters.
pending_turns and sent_images gates.
Training and Tuning Loop
Train and evaluate from the glasses point of view. Record representative Rokid camera footage including bad lighting, hand occlusion, motion blur, partial objects, and the distances users actually work at.Adjust labels and rules
Adjust labels, thresholds, and confirmation rules before changing client or workflow logic.