Voice Interface

Rubber Duck is a voice-first coding companion. You talk out loud, it answers out loud, and you can interrupt at any time to steer the conversation.

Activation Hotkey

The global hotkey Option+D (⌥D) activates the voice interface from anywhere on macOS.

The first time you press Option+D, Rubber Duck will automatically attach to the workspace you have open in the terminal where you ran duck [path].

Hotkey Behavior

The activation hotkey is a toggle:

Press when idle: Connects to the OpenAI Realtime API, starts audio streaming, and begins listening for speech
Press when active: Immediately disconnects the session and stops all audio streaming

The hotkey does not require push-to-talk. Once activated, the session uses server-side voice activity detection (VAD) to detect when you start and stop speaking.

You can change the activation hotkey in Settings > Hotkeys. The default settings hotkey is Option+Shift+D (⌥⇧D).

Speech Detection

Rubber Duck uses the OpenAI Realtime API’s server-side VAD to detect speech:

Speech started: The server detects you’ve begun speaking (Rubber Duck shows “Listening” state)
Speech stopped: The server detects a natural pause (Rubber Duck transitions to “Thinking”)
Assistant response: The model processes your input and responds with audio and text

The server VAD is tuned for natural conversation. You don’t need to pause artificially — just speak normally and let the model detect turn boundaries.

Noise Gate

Rubber Duck applies a client-side noise gate to prevent ambient noise from triggering false speech detection:

Quiet audio below -46 dBFS is replaced with silence
Without echo cancellation, the threshold is raised to -38 dBFS to reduce speaker bleed
A hold time of 250ms (or 350ms without AEC) keeps the gate open during natural inter-word pauses

This is implemented in AudioManager.swift:471-503.

Audio Streaming

Rubber Duck streams audio at 24 kHz PCM16 mono to the OpenAI Realtime API. The capture path is:

Microphone → AVAudioInputNode → Noise Gate → Format Conversion → Base64 → WebSocket

The audio engine (AudioManager.swift) supports two modes:

Voice Processing Mode (Preferred)

Enables VoiceProcessingIO on the AVAudioEngine input node, which provides:

Acoustic Echo Cancellation (AEC): Hardware-level echo cancellation at the driver level
Noise Suppression: Additional background noise reduction
AGC: Automatic gain control for consistent input levels

When VoiceProcessingIO is active, the microphone stays open during assistant playback, enabling instant barge-in detection.

On some multi-channel devices (e.g., MacBook Pro with 9-channel mic array), Rubber Duck automatically downmixes to mono for compatibility with VoiceProcessingIO.

Standard Mode (Fallback)

If VoiceProcessingIO fails to initialize (e.g., due to incompatible hardware or audio routing), Rubber Duck falls back to standard audio mode:

No hardware AEC
Software AEC is applied: the playback reference signal is subtracted from the captured microphone signal to cancel echo
Input muting during assistant playback (unmuted after audio drains) to prevent residual echo

Software AEC uses a ring buffer (PlaybackReferenceBuffer) populated by the playback manager, with adaptive delay estimation and gain calibration.

Text-to-Speech

Assistant responses are streamed as audio deltas from the OpenAI Realtime API and played through the system speaker. The playback path is:

WebSocket → Base64 Decode → PCM16 Buffer → AVAudioPlayerNode → AVAudioEngine output

Implemented in AudioPlaybackManager.swift.

Smart Truncation for Voice

Rubber Duck filters assistant responses before speaking to avoid reading long code blocks or diffs verbatim:

Short responses: Spoken verbatim
Responses with long code blocks: A summary is spoken, and the assistant says “details are in the terminal”

This keeps the voice conversation natural while preserving full detail in the CLI output.

The terminal always shows the full assistant response with syntax highlighting and diffs, even if the spoken version is summarized.

Barge-In (Interruptions)

Rubber Duck supports immediate barge-in: you can interrupt the assistant at any time while it’s speaking.

How Barge-In Works

Speech detection during playback: If the server detects speech_started while the assistant is speaking, Rubber Duck schedules a confirmation delay (default: 350ms)
Confirmation window: If speech continues for the full delay, barge-in is confirmed
Stop playback immediately: AVAudioPlayerNode.stop() is called to halt audio output
Truncate response: Rubber Duck sends conversation.item.truncate to the server with the exact playback position (in milliseconds)
Resume listening: The session transitions back to “Listening” state

Implemented in VoiceSessionCoordinator.swift:286-389.

The barge-in confirmation delay prevents false positives from echo (your speakers being picked up by your microphone). On devices without hardware AEC, this delay is increased to 550ms.

Abort vs. Steer Modes

Rubber Duck offers two interruption modes (configured in Settings):

Auto-Abort (Default)

When you interrupt, Rubber Duck:

Stops playback and truncates the assistant’s response at the exact playback position
The server discards any planned tool calls that haven’t started yet
Your new speech is treated as a fresh user turn

This is best for course corrections (“wait, that’s wrong”) or topic changes (“actually, let’s do something else”).

Steer Mode (Auto-Abort Disabled)

When you interrupt in steer mode, Rubber Duck:

Stops playback immediately
Sends your new speech as a steering message to the server
The server delivers the steer message after the current tool completes
Any remaining planned tools are skipped

This is best for refinements (“add error handling to that function”) where you want the current operation to finish before applying your feedback.

The CLI always prints a line indicating whether the run was aborted or steered so you have full visibility into what happened.

Echo Cancellation and Barge-In

Barge-in reliability depends on echo cancellation:

Mode	Microphone During Playback	Confirmation Delay	Echo Risk
VoiceProcessingIO (Hardware AEC)	Always open	350ms (default)	Very low
Software AEC	Always open	450ms minimum	Low
No AEC	Muted, then unmuted after playback	550ms minimum	Medium (echo during unmute transition)

The confirmation delay is automatically tuned based on the active echo cancellation mode.

Session State Visualization

The menu bar icon and overlay reflect the current voice session state:

State	Icon	Overlay
Idle	Default	Hidden
Connecting	Animated	”Thinking…”
Listening	Microphone	”Listening”
Thinking	Brain	”Thinking…”
Speaking	Speaker	”Speaking”
Tool Running	Gear	Tool name (e.g., “bash”)

Implemented in VoiceSessionCoordinator.swift:6-13 and OverlayPresenter.swift.

Microphone Permissions

Rubber Duck requires microphone access to capture audio. On first launch, macOS will prompt for permission. If you deny permission or revoke it later:

The voice interface will not activate
duck doctor will show a warning
You can grant permission in System Settings > Privacy & Security > Microphone

Rubber Duck provides a helper link in Settings to open the macOS microphone permissions pane directly.

API Key Management

Rubber Duck stores your OpenAI API key securely in the macOS Keychain. You can set or update the key in Settings > API Keys.

Without a valid API key, the voice interface will not connect. Rubber Duck will open Settings automatically if no key is found.

Interruptions - Deep dive into barge-in behavior and abort vs. steer modes
Sessions - Session management and multi-workspace support
CLI Commands - Using the duck CLI to attach workspaces and send messages

Get Started

Core Features

CLI Reference

Architecture

Guides

Activation Hotkey

Hotkey Behavior

Speech Detection

Noise Gate

Audio Streaming

Voice Processing Mode (Preferred)

Standard Mode (Fallback)

Text-to-Speech

Smart Truncation for Voice

Barge-In (Interruptions)

How Barge-In Works

Abort vs. Steer Modes

Auto-Abort (Default)

Steer Mode (Auto-Abort Disabled)

Echo Cancellation and Barge-In

Session State Visualization

Microphone Permissions

API Key Management

Build docs developers (and LLMs) love

Get Started

Core Features

CLI Reference

Architecture

Guides

​Activation Hotkey

​Hotkey Behavior

​Speech Detection

​Noise Gate

​Audio Streaming

​Voice Processing Mode (Preferred)

​Standard Mode (Fallback)

​Text-to-Speech

​Smart Truncation for Voice

​Barge-In (Interruptions)

​How Barge-In Works

​Abort vs. Steer Modes

​Auto-Abort (Default)

​Steer Mode (Auto-Abort Disabled)

​Echo Cancellation and Barge-In

​Session State Visualization

​Microphone Permissions

​API Key Management

​Related

Build docs developers (and LLMs) love

Activation Hotkey

Hotkey Behavior

Speech Detection

Noise Gate

Audio Streaming

Voice Processing Mode (Preferred)

Standard Mode (Fallback)

Text-to-Speech

Smart Truncation for Voice

Barge-In (Interruptions)

How Barge-In Works

Abort vs. Steer Modes

Auto-Abort (Default)

Steer Mode (Auto-Abort Disabled)

Echo Cancellation and Barge-In

Session State Visualization

Microphone Permissions

API Key Management

Related