OpenClicky ships a local-only HTTP and SSE bridge that lets agents, scripts, and trusted local apps send visual guidance commands directly to the overlay layer — without entering the normal voice, conversation, or agent state machine. The bridge is the foundation for every programmatic interaction with OpenClicky’s on-screen affordances: pointing, captioning, speech, and screenshots all route through it.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/jasonkneen/openclicky/llms.txt
Use this file to discover all available pages before exploring further.
What the bridge is and why it exists
When an agent or external tool needs to show the user something on screen — “click this button”, “look at this panel”, “here are three areas to compare” — it should not have to submit a conversation prompt or start a new Clicky session to do so. The bridge provides a direct, lightweight channel for exactly those actions. The bridge is intentionally non-invasive by design. It operates as a side channel into OpenClicky’s overlay and TTS subsystems. Nothing sent to it starts a new dictation session, modifies the conversation state, or creates or destroys agent sessions. The main Clicky session is completely unaffected.Address
The bridge always listens on the loopback interface:127.0.0.1 — it is not reachable from other machines on the network.
Most bridge endpoints require a bridge token. Configure
OPENCLICKY_BRIDGE_TOKEN in your secrets file or in OpenClicky Settings. Pass it as the x-openclicky-token request header or as a standard Authorization: Bearer <token> header. The /health endpoint does not require a token.What the bridge can do
- Drive the overlay — point the primary OpenClicky cursor at any macOS screen coordinate, place simultaneous secondary markers, or show floating captions.
- Capture screenshots — request JPEG screenshots of all displays or just the focused window, receiving local file paths and AppKit-coordinate display frames back.
- Speak through TTS — send a short spoken instruction through OpenClicky’s TTS without touching push-to-talk or voice-response mode.
- Clear overlay elements — remove any bridge-created cursors and captions in one call.
- Stream bridge events — subscribe to a server-sent event stream to receive
readyandcommandacknowledgements.
What the bridge cannot do
The bridge deliberately has no access to the following:- Starting or stopping push-to-talk dictation
- Submitting prompts into the active Clicky conversation
- Creating or spawning new agent or worker sessions
- Reading or mutating any part of the conversation state, memory, or context
Primary vs secondary cursor model
Understanding how OpenClicky’s two cursor modes differ is important for building correct interactions. Primary cursor (mode: "primary", the default) uses OpenClicky’s native smooth pointing choreography — the same animation triggered when a user says “show me the Apple menu” in voice mode. The triangular OpenClicky cursor zips from its current resting position to the target coordinate, displays the caption, and then flies back. Critically, it does not warp the real macOS system pointer and does not draw a duplicate cursor icon at the target. Use the primary cursor when you want the same experience that Clicky’s built-in guidance produces.
Secondary cursors (mode: "secondary" on /cursor, or all cursors created by /cursors) are explicit temporary colored markers. They appear at the given coordinates, show their captions, and disappear automatically after durationMs milliseconds or when /clear is called. Use secondary cursors for multi-point explanations, side-by-side comparisons, or screen tour overlays where you want several markers visible simultaneously.
Checking bridge status
Send aGET /health request to confirm the bridge is running. No token is required. The response includes the port number, transport type, and the list of available tool names.
scripts/test-external-control-bridge.sh exercises the bridge end-to-end: Swift parse/typecheck checks, health, MCP descriptors, screenshot capture, captions, secondary cursors, SSE events, and primary cursor choreography verification.
SSE event stream
Subscribe toGET /events to receive server-sent events. The connection stays open and delivers:
- An
event: readyevent immediately on connect, confirming the bridge is live. - An
event: commandevent after every successful command processed by the bridge, carryingok,path, and (for batch calls)count.
SSE connections require a valid bridge token like all other authenticated endpoints.
Bundled skills
Three bundled agent skills inAppResources/OpenClicky/OpenClickyBundledSkills/ use the bridge directly:
| Skill | Purpose |
|---|---|
openclicky-screen-control | Quick point, caption, screenshot, speak, and clear commands for immediate visual guidance when a user asks “show me where you mean” or “point to it”. |
openclicky-screen-tour | Recordable visual tours with multiple simultaneous markers, area-focused overlays, primary cursor choreography, screenshots, captions, and TTS narration. |
google-workspace-gogcli | Local Google Workspace access through gogcli. Not a visual skill, but it uses the bridge for cursor and caption feedback during Workspace interactions. |
Use cases
Visual tutorials — When a user asks how to do something in macOS or an app, an agent can capture a screenshot, identify the relevant control, and call/cursor with a short caption. No new agent session required; the answer is immediate and on-screen.
Multi-point screen tours — An agent or skill can call /cursors with a list of UI points and captions, placing several markers simultaneously. This is ideal for orientation overviews (“here is the menu bar, the editor, the sidebar, and the logs panel”) before zooming into each item with the primary cursor.
External agent coordination — A Cursor extension, Claude Desktop integration, or any local MCP-aware agent runtime can call /mcp/call or /mcp/calls to drive OpenClicky’s overlay as part of a larger workflow, using OpenClicky purely as a display layer while the agent does its own reasoning elsewhere.
Recordable demos — The screen-tour skill uses a compact coordinate region derived from the current display’s visibleFrame, keeps markers small, and uses short captions — all optimized for screen recordings where the overlay should not occlude important content.