IKEA Box Assembly Assistant with OpenAI Realtime API

The rokid-openai-realtime example turns Rokid Glasses into a voice-first assembly assistant. The glasses stream microphone audio and camera video over WebRTC directly to the OpenAI Realtime API. A lightweight Node.js backend brokers the WebRTC session and handles sideband tool calls — in the default setup, the assistant walks the wearer through assembling an IKEA wooden box, loading step-by-step instructions on demand via two tool functions. Swap the instruction files or rewrite the system prompt and you have a domain-specific hands-free assistant for any physical task.

Features

End-to-end WebRTC — mic and camera stream from the glasses to OpenAI Realtime with no intermediate media relay.
Real-time audio and vision — the assistant observes live video frames and hears the wearer simultaneously.
Assistant speech playback — the Realtime API streams audio back and the glasses play it through their speaker.
Sideband tool calls — a backend WebSocket listens for response.done events and handles list_items / load_item_instructions function calls so the model can dynamically load assembly instructions.

Architecture

Rokid Glasses (Android)
  ├── Mic + Camera → WebRTC → OpenAI Realtime API
  └── POST /session (SDP offer) → Backend (Node.js)
                                      ├── Forwards SDP to OpenAI /v1/realtime/calls
                                      ├── Returns SDP answer to glasses
                                      └── Opens sideband WebSocket → OpenAI Realtime API
                                              (handles list_items / load_item_instructions)

Component	Location	Language
Glasses app	`rokid/`	Kotlin
Backend session broker + tool handler	`backend/`	TypeScript (Node.js 24, ESM)

The Android entry point is MainActivity. It auto-starts streaming after camera and mic permissions are granted. A temple tap (KEYCODE_DPAD_CENTER / ENTER) toggles start and stop. Media is managed by OpenAIRealtimeClient using the Stream WebRTC library. The backend entry point is backend/server.ts. The POST /session endpoint accepts a raw SDP offer body, forwards it to OpenAI with the session configuration, and returns the SDP answer. It then opens a sideband WebSocket to the same call ID and handles tool calls from that connection.

Requirements

Rokid Glasses + dev cable
Android Studio with adb
Node.js 24
OpenAI API key (OPENAI_API_KEY)

Configuration

Configure the glasses app

Fill out rokid/local.properties with the URL of your running backend:

SESSION_URL=http://<YOUR_BACKEND>/session

Configure the backend

Copy the example env file and set your API key:

cd backend
cp .env.example .env
# Set OPENAI_API_KEY in .env

Run the Backend

cd backend
npm install
npm run start

The server listens on port 3000 by default. Override with the PORT environment variable.

Run the Glasses App

Connect Rokid Glasses and enable Wi-Fi

Connect the glasses to your computer using the dev cable, then run:

adb devices                          # confirm device is visible
adb shell cmd wifi status            # check if already connected
adb shell cmd wifi set-wifi-enabled enabled
adb shell 'cmd wifi connect-network "NAME" wpa2 "PASSWORD"'
adb shell cmd wifi status            # confirm connection

Optional: set up wireless ADB

adb shell ip -f inet addr show wlan0  # check glasses IP
ping -c 5 -W 3 <IP>                   # first ping may time out
adb tcpip 5555
adb connect <IP>
adb devices                           # verify remote connection

After wireless ADB is connected you can unplug the dev cable.

Build and run from Android Studio

Open the rokid/ directory in Android Studio, select Rokid Glasses as the target device, and run the app. The app auto-starts streaming once permissions are granted.To rebuild manually after code changes:

cd rokid && ./gradlew :app:assembleDebug

Session Configuration

The backend sends a sessionConfig object to OpenAI with audio settings, the system prompt, and tool definitions. Key fields from backend/server.ts:

const sessionConfig = {
  type: "realtime",
  model: "gpt-realtime",
  audio: {
    input: {
      noise_reduction: { type: "near_field" },
      transcription: { language: "en", model: "whisper-1" },
      turn_detection: { type: "semantic_vad" },
    },
    output: { voice: "marin" },
  },
  instructions: SESSION_INSTRUCTIONS,
  tools: [
    {
      type: "function",
      name: "list_items",
      description:
        "List all available item names for which assembly instructions exist. Returns an array of strings; each string is a valid `item_name` that can be passed to `load_item_instructions`.",
    },
    {
      type: "function",
      name: "load_item_instructions",
      description:
        "Load the assembly instructions for the given item name. Returns the full text content for that item.",
      parameters: {
        type: "object",
        properties: {
          item_name: {
            type: "string",
            description:
              "An item name chosen from the array returned by `list_items`; must match one of those strings.",
          },
        },
        required: ["item_name"],
      },
    },
  ],
} as const;

Customize Instructions

To add or replace assembly instructions:

Add an instruction file — place a .txt file in backend/items/. The filename (without the .txt extension) becomes the item name returned by list_items. The current example is backend/items/ikea-wooden-box.txt.
Edit the system prompt — change SESSION_INSTRUCTIONS in backend/server.ts to adjust the assistant’s role, personality, and conversation rules.

The SESSION_INSTRUCTIONS constant in server.ts includes rules for conversation flow (identify → load → guide → end), audio handling, and tool usage ordering. Read it carefully before customizing — the tool call sequence matters for the assistant to pick the right item instructions reliably.

How Sideband Tool Calls Work

When the OpenAI Realtime API completes a function call, it emits a response.done event over the sideband WebSocket. The backend parses it, calls runTool() with the function name and arguments, and sends back a conversation.item.create message with the tool output followed by response.create to resume generation.

ws.on("message", async (raw) => {
  const msg = JSON.parse(raw.toString());

  const fnCall = msg.response?.output?.find(
    (o: any) => o.type === "function_call" && o.status === "completed",
  );

  if (msg.type === "response.done" && msg.response?.status === "completed" && fnCall) {
    const args = JSON.parse(fnCall.arguments);
    const output = await runTool(fnCall.name, args);
    ws.send(JSON.stringify({
      type: "conversation.item.create",
      item: {
        type: "function_call_output",
        call_id: fnCall.call_id,
        output: output,
      },
    }));
    ws.send(JSON.stringify({ type: "response.create" }));
  }
});

rokid-openai-realtime-rfdetr — an updated version of this project that adds RF-DETR object detection on the backend and injects annotated frames into the Realtime conversation for more accurate spatial understanding. See its README.md in the repository for setup steps.
Speedrun Timer (RF-DETR) — vision-only speedrun HUD with RF-DETR detection and split timing.
Proactive Drink-making Coach — full-stack example combining Overshoot vision inference, OpenAI Realtime speech, and a server-authoritative workflow.

Get Started

Core Concepts

Guides

Examples

IKEA Box Assembly Assistant with OpenAI Realtime API

Features

Architecture

Requirements

Configuration

Run the Backend

Run the Glasses App

Session Configuration

Customize Instructions

How Sideband Tool Calls Work

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Examples

Documentation Index

​Features

​Architecture

​Requirements

​Configuration

​Run the Backend

​Run the Glasses App

​Session Configuration

​Customize Instructions

​How Sideband Tool Calls Work

​Related Examples

Build docs developers (and LLMs) love

Features

Architecture

Requirements

Configuration

Run the Backend

Run the Glasses App

Session Configuration

Customize Instructions

How Sideband Tool Calls Work

Related Examples