Voice Interaction in OpenClicky

OpenClicky is designed from the ground up as a voice-first assistant. Every part of the voice pipeline — activation shortcut, wake-word listening, speech-to-text transcription, AI reasoning, and spoken response — is pluggable, local-first, and built to stay out of your way until you need it. You hold a key, speak naturally, release, and Clicky answers — pointing at your screen when that helps.

How Activation Works

OpenClicky supports three distinct voice activation modes, selectable in Settings → Voice:

Push to Talk

Hold the activation shortcut while speaking. Release to submit. The default and most reliable mode.

Toggle + Wake Word

Press the shortcut once to arm wake-word detection, then say Hey Clicky to start recording.

Always Wake Word

Keeps the wake-word listener armed at all times. Say Hey Clicky from anywhere to start a voice turn.

Push-to-Talk Shortcut

The default shortcut is Control + Option (hold both, speak, release). OpenClicky also supports Shift + Fn, Shift + Control, Ctrl + Option + Space, and Shift + Control + Space — all configurable from Settings. Activation is handled by GlobalPushToTalkShortcutMonitor, which installs a listen-only CGEvent tap on the session event stream. Because it is listen-only the tap never suppresses keystrokes, so the shortcut works in any app without stealing input. The tap monitors .flagsChanged, .keyDown, and .keyUp events, re-enables itself automatically if macOS ever disables it due to timeout or user input, and publishes transitions via a PassthroughSubject to the dictation manager. A double-tap Shift gesture is also detected: two rapid standalone Shift presses (within 420 ms) open the OpenClicky notch panel at the current mouse location without entering voice mode.

// Simplified transition logic from BuddyPushToTalkShortcut
static let currentShortcutOption: ShortcutOption = .controlOption

static func shortcutTransition(
    for eventType: CGEventType,
    keyCode: UInt16,
    modifierFlagsRawValue: UInt64,
    wasShortcutPreviouslyPressed: Bool
) -> ShortcutTransition

Wake-Word Detection

When a wake-word mode is active, OpenClickyWakeWordManager runs Apple’s on-device SFSpeechRecognizer in a continuous low-power loop. It listens only locally — no audio is sent anywhere until the wake phrase triggers full dictation. The wake phrase is “Hey Clicky” (case-insensitive, diacritic-insensitive). The detector also recognises common mishearings:

Accepted phrase	Notes
`hey clicky`	Primary phrase
`hay clicky`	Common mishearing
`hey cliquey`	Common mishearing
`hay cliquey`	Common mishearing

Wake-word listening requires SFSpeechRecognizer.supportsOnDeviceRecognition to be true on your Mac. OpenClicky will not fall back to a remote speech gate for always-listening mode, because sending ambient audio to a cloud service would be a privacy problem. If on-device recognition is unavailable, use push-to-talk instead.

Pluggable Transcription Providers

BuddyDictationManager captures microphone audio with AVAudioEngine and routes it through the active provider. Providers are swapped at runtime without restarting the app — changing the provider in Settings takes effect on the next voice press.

Apple Speech

Local, no API key required. Uses SFSpeechRecognizer for streaming on-device recognition. Free and private, but accuracy varies by accent and ambient noise. Requires both Microphone and Speech Recognition permissions.

OpenAI Whisper

Cloud-based streaming. Routes audio to OpenAI’s transcription API. High accuracy and strong support for technical vocabulary. Requires OPENAI_API_KEY.

AssemblyAI

Streaming cloud transcription. Connects over a WebSocket for low-latency partial results. Requires ASSEMBLYAI_API_KEY in Settings or secrets.env.

Deepgram

Streaming cloud transcription. Also WebSocket-based; strong on technical and developer vocabulary. Requires DEEPGRAM_API_KEY.

The Automatic setting (default) picks the best configured provider: Deepgram or AssemblyAI if an API key is present, otherwise Apple Speech. You can always lock a specific provider in Settings → Voice → Transcription.

How the audio pipeline works

AVAudioEngine taps the microphone input node with a buffer size of 256 frames — a small buffer deliberately chosen to minimise capture-to-provider handoff latency.
Each AVAudioPCMBuffer is forwarded to the active BuddyStreamingTranscriptionSession via appendAudioBuffer(_:).
The session calls back onTranscriptUpdate with partial results, which are rendered live in the input bar as you speak.
When you release the shortcut, requestFinalTranscript() is called. A fallback timer (default 2.4 seconds) submits the best available partial if the final transcript callback hasn’t fired yet.
If the session reports a “no speech detected” error and the transcript buffer is empty, the interaction is quietly discarded rather than submitted as an empty message.

// Tap installed by BuddyDictationManager
inputNode.installTap(onBus: 0, bufferSize: 256, format: inputFormat) { buffer, _ in
    self?.activeTranscriptionSession?.appendAudioBuffer(buffer)
    self?.updateAudioPowerLevel(from: buffer)
}

Context-Aware Key Terms

The dictation manager builds a list of contextual key terms that are forwarded to the transcription provider when it supports hint weighting. The built-in list includes technical terms like SwiftUI, Xcode, Vercel, Next.js, Claude, Anthropic, and Codex. You can extend this list programmatically for project-specific vocabulary.

TTS Providers: How Clicky Speaks

After Claude generates a response, OpenClicky reads it aloud through one of five text-to-speech providers, selected in Settings → Voice → Speech:

Provider	Key Required	Notes
GPT Realtime (default)	`OPENAI_API_KEY`	OpenAI’s realtime speech model. Low latency, natural pacing.
ElevenLabs	`ELEVENLABS_API_KEY` + Voice ID	High-quality, expressive voices. Configure voice ID in Settings.
Cartesia	`CARTESIA_API_KEY` + Voice ID	Fast streaming TTS.
Deepgram Aura	`DEEPGRAM_API_KEY`	Reuses the STT key. Defaults to Aura 2 Thalia voice.
Microsoft Edge	None	Free fallback using Edge TTS.

ElevenLabs and Cartesia both support custom voice IDs. Set ELEVENLABS_VOICE_ID or your Cartesia voice ID in ~/.config/openclicky/secrets.env to use a specific voice. Deepgram TTS voice can be overridden with DEEPGRAM_TTS_VOICE.

Notch Panel Visual States

The OpenClicky notch panel — the compact surface that appears at the top of your screen — reflects the current voice state through colour and iconography:

Ready

Accent colour (blue by default). Clicky is idle and waiting for input. Icon: bolt.fill.

Listening

Green. Microphone is active and audio is being captured. A live waveform visualisation shows audio power levels. Icon: waveform.

Thinking

Orange. Audio has been submitted and Claude is generating a response. Icon: sparkles.

Speaking

Purple. TTS is playing back the response. Icon: speaker.wave.2.fill.

These map directly to the voiceState enum in CompanionManager:

// From OpenClickyNotchPanelView.swift
private var activeVoiceAccent: Color {
    switch companionManager.voiceState {
    case .idle:       return DS.Colors.accentText  // accent (blue)
    case .listening:  return .green
    case .processing: return .orange
    case .responding: return .purple
    }
}

The waveform during listening mode is driven by recordedAudioPowerHistory — a rolling 44-sample history of RMS audio levels, sampled every 30 ms and smoothed to prevent jitter.

Permissions Required

Microphone

Required for all voice modes. OpenClicky requests this the first time you press the shortcut. If denied, open System Settings → Privacy & Security → Microphone.

Speech Recognition

Required only for Apple Speech and Wake Word modes, which use SFSpeechRecognizer. If denied, go to System Settings → Privacy & Security → Speech Recognition.

Accessibility

Required for the global CGEvent tap that powers push-to-talk. Grant in System Settings → Privacy & Security → Accessibility.

OpenClicky debounces permission requests to avoid showing multiple system sheets for rapid shortcut presses. It also keeps a 1-second cooldown after a permission request completes, because macOS can briefly report .notDetermined even right after the user taps Allow.

Tips for Better Voice Responses

Speak at a consistent pace and finish your sentence before releasing the shortcut. The 2.4-second finalisation timer will wait for the provider to deliver its final result — don’t rush to release.

If Clicky mishears technical terms, use Deepgram or AssemblyAI — both have stronger developer-vocabulary models than Apple Speech out of the box.

On Apple Silicon Macs without a configured cloud key, wake-word mode falls back to Apple Speech for the initial detection gate. If you find wake-word activation unreliable, switch to push-to-talk or add an OpenAI or Deepgram key.

You can include text in the prompt bar before pressing the shortcut. OpenClicky will append your spoken text to whatever is already typed, giving you a convenient way to add voice context to a partially typed message.

Get Started

Core Features

Skills

Integrations

Voice Interaction in OpenClicky

How Activation Works

Push to Talk

Toggle + Wake Word

Always Wake Word

Push-to-Talk Shortcut

Wake-Word Detection

Pluggable Transcription Providers

Apple Speech

OpenAI Whisper

AssemblyAI

Deepgram

Context-Aware Key Terms

TTS Providers: How Clicky Speaks

Notch Panel Visual States

Ready

Listening

Thinking

Speaking

Permissions Required

Tips for Better Voice Responses

Build docs developers (and LLMs) love

Get Started

Core Features

Skills

Integrations

Documentation Index

​How Activation Works

Push to Talk

Toggle + Wake Word

Always Wake Word

​Push-to-Talk Shortcut

​Wake-Word Detection

​Pluggable Transcription Providers

Apple Speech

OpenAI Whisper

AssemblyAI

Deepgram

​Context-Aware Key Terms

​TTS Providers: How Clicky Speaks

​Notch Panel Visual States

Ready

Listening

Thinking

Speaking

​Permissions Required

​Tips for Better Voice Responses

Build docs developers (and LLMs) love

How Activation Works

Push-to-Talk Shortcut

Wake-Word Detection

Pluggable Transcription Providers

Context-Aware Key Terms

TTS Providers: How Clicky Speaks

Notch Panel Visual States

Permissions Required

Tips for Better Voice Responses