Pipeline Overview
Audio Capture (AudioManager)
Configuration
Format:- Sample rate: 24,000 Hz (OpenAI Realtime API requirement)
- Format: PCM16 (16-bit linear PCM)
- Channels: Mono
- Encoding: Base64 for WebSocket transport
VoiceSessionCoordinator.swift:546-575):
Echo Cancellation
Rubber Duck supports three echo cancellation modes:-
Hardware AEC (VoiceProcessingIO): Best quality, enabled by default on supported devices
- Real-time echo cancellation in hardware
- Allows mic to stay open during TTS playback for instant barge-in
- Detected via
audioManager.isEchoCancellationActive
-
Software AEC: Fallback for devices without hardware AEC
- Signal processing to reduce echo
- Requires longer confirmation delay before barge-in
- Detected via
audioManager.isSoftwareAECActive
-
No AEC: Fallback when neither is available
- Input is muted during TTS playback
- Unmuted after playback queue drains
- Longer speech suppression windows to avoid false triggers
VoiceSessionCoordinator.swift:414-428):
Voice Activity Detection (VAD)
Rubber Duck uses server-side VAD from the OpenAI Realtime API for turn detection: Server Events:input_audio_buffer.speech_started: User started speakinginput_audio_buffer.speech_stopped: User stopped speaking (triggers response)
VoiceSessionCoordinator.swift:824-876):
To prevent echo-induced false positives, the app applies temporal guards:
turn_detection.type:server_vadturn_detection.threshold: Default (0.5)turn_detection.prefix_padding_ms: 300turn_detection.silence_duration_ms: 500
Speech-to-Text (STT)
The OpenAI Realtime API provides streaming transcription as the user speaks. Events:conversation.item.input_audio_transcription.completed: Final transcript of user speech- Transcript is automatically added to conversation context
VoiceSessionCoordinator.swift:1051-1053):
- Optimized for natural speech (“um”, “uh” filtered by server)
- Displayed in CLI as
[user]event - Stored in conversation history for context
Text-to-Speech (TTS)
Assistant responses are synthesized by the OpenAI Realtime API and streamed as audio chunks. Events:response.audio.delta: Incremental audio chunks (base64 PCM16)response.audio.done: Audio generation complete for this response
VoiceSessionCoordinator.swift:614-618):
- Short, conversational responses
- Avoids reading long code blocks (says “details are in the terminal” instead)
- Summarizes tool output rather than speaking raw data
VoiceSessionCoordinator.swift:911-931):
Barge-In (Interruption Handling)
Barge-in allows the user to interrupt the assistant mid-sentence by speaking.Detection Flow
Barge-In Implementation (VoiceSessionCoordinator.swift:286-389)
Confirmation Delay:
- Hardware AEC: 0.35s (configurable, default)
- Software AEC: 0.45s (minimum)
- No AEC: 0.55s (minimum)
autoAbortOnBargeIn = false):
- Stops playback but does NOT truncate the response
- Server continues processing current response
- User speech is queued as next turn
Response Truncation
When auto-abort is enabled, Rubber Duck sends a precise truncation command to the Realtime API: Message (conversation.item.truncate):
State Machine
TheVoiceSessionCoordinator manages voice session state:
VoiceSessionCoordinator.swift:391-430):
idle → connecting: User presses hotkeyconnecting → listening: Session readylistening → thinking: User stops speakingthinking → speaking: Audio delta receivedspeaking → listening: Playback complete or barge-in* → toolRunning: Function call detectedtoolRunning → thinking: Tool complete, request next response
Tool Execution During Voice
When the assistant requests a tool call (e.g.,read_file), the voice pipeline pauses and delegates to the daemon:
-
Function Call Detected (
VoiceSessionCoordinator.swift:1002-1016): -
Daemon Execution (
VoiceSessionCoordinator.swift:1077-1134): -
Resume Voice:
Error Handling
Microphone Errors
- Permission Denied: Display settings prompt, offer to open System Settings
- Hardware Unavailable: Show overlay error, disconnect session
- Audio Startup Failure: Log error, set sticky disconnect message
API Errors
Retryable (connection issues, rate limits):- Transition to
connectingstate - Realtime client auto-reconnects with exponential backoff
- Set
stickyDisconnectErrorMessage - Disconnect session
- Show overlay error with message
VoiceSessionCoordinator.swift:1146-1165):
Certain errors are benign (e.g., truncating a response that already ended):
Daemon Connection Loss
- During Voice Session: Tools return error message, voice continues
- On Reconnect: App re-registers with daemon via
voice_connect - Permanent Loss: App continues voice-only (no workspace tools)
Performance Characteristics
Latency Budget
- User Speech → STT Transcript: ~500ms (server VAD silence duration)
- Response Start → First Audio Delta: ~300-800ms (model + TTS generation)
- Audio Delta → Playback: <50ms (local decode + enqueue)
- Barge-In Detection → Playback Stop: <100ms (hardware), ~400ms (software AEC)
Audio Buffering
- Capture Buffer: 100ms chunks (2400 samples @ 24kHz)
- Playback Queue: Adaptive (tracks unplayed duration for smooth transitions)
- WebSocket Send: Non-blocking, queued writes
State Synchronization
- Voice State → Daemon: Push on connect, tool calls, and disconnect
- Daemon → Voice: Push on workspace/session change from CLI
- Polling Fallback: 2s interval if daemon unavailable (workspace sync only)
Configuration
Runtime Settings (loaded fromUserDefaults, VoiceSessionCoordinator.swift:613-618):
AudioConstants.swift):
Testing
Rubber Duck includes an E2E test for the full voice pipeline:- Connects to Realtime API
- Sends pre-recorded audio (“What is 2+2?”)
- Waits for response audio
- Validates transcript and audio playback
- Disconnects cleanly
RubberDuckTests/RealtimeE2ETests.swift for implementation.
Next Steps
- Pi Integration — How tools are executed via the daemon
- Session Model — Multi-session concurrency and persistence