What is Barge-In?
Barge-in (also called “interruption” or “cut-in”) is the ability to interrupt the assistant mid-response by speaking. When you interrupt:- Audio playback stops immediately (within ~50ms)
- The server is notified via
conversation.item.truncatewith the exact playback position - Your new speech is processed as either a fresh user turn (abort mode) or a steering message (steer mode)
- The session returns to listening and you can continue the conversation
How Barge-In Works
1. Speech Detection During Playback
While the assistant is speaking (sessionState == .speaking), Rubber Duck continues to monitor for incoming speech_started events from the OpenAI Realtime API’s server-side VAD.
The microphone remains open during playback when hardware AEC (VoiceProcessingIO) is active. Without hardware AEC, the microphone is muted during playback and re-enabled after the audio queue drains (see Echo Cancellation).
Implemented in VoiceSessionCoordinator.swift:824-876.
2. Confirmation Delay
Whenspeech_started is detected during playback, Rubber Duck schedules a confirmation delay before triggering barge-in.
The delay prevents false positives from:
- Echo: The assistant’s voice being picked up by the microphone (“echo bleed”)
- Noise: Brief background sounds triggering VAD
| Echo Cancellation | Minimum Confirmation Delay |
|---|---|
| Hardware AEC (VoiceProcessingIO) | 350ms (default) |
| Software AEC | 450ms |
| No AEC | 550ms |
VoiceSessionCoordinator.swift:315-324.
You can adjust the base confirmation delay in Settings, but the minimum is enforced based on your AEC mode to prevent echo-triggered interruptions.
3. Playback Stop
If speech continues for the full confirmation delay, barge-in is confirmed. Rubber Duck:- Calls
AVAudioPlayerNode.stopImmediately()to halt audio playback - Captures the exact playback position (samples played out of total scheduled)
- Calculates
audioEndin milliseconds (relative to the start of the current audio item)
VoiceSessionCoordinator.swift:326-389.
4. Response Truncation
Rubber Duck sendsconversation.item.truncate to the server with:
item_id: The ID of the audio item being interruptedcontent_index: The index of the audio content part (usually 0)audio_end_ms: The playback position in milliseconds (clamped to the item duration)
response.cancelled, and Rubber Duck transitions back to listening state.
Truncation is precise: the conversation tree reflects exactly what you heard, not what the model generated. This prevents the model from assuming you heard content that was cut off.
5. New User Turn
Your new speech is processed as a fresh user turn. The server transcribes it and appends it to the conversation, then generates a new response. The interrupted response is discarded (in abort mode) or paused (in steer mode — see below).Abort vs. Steer Modes
Rubber Duck offers two interruption strategies, controlled by the “Auto-abort on barge-in” toggle in Settings.Abort Mode (Default)
When enabled (Auto-abort ON):- Playback stops immediately
- The assistant’s response is truncated at the playback position
- Planned tool calls are discarded (if the response included tool calls that haven’t started yet)
- Your new speech is treated as a fresh user turn
- Course corrections: “Wait, that’s wrong — try X instead”
- Topic changes: “Actually, let’s do something else”
- Canceling long-running operations: “Stop — I don’t need that anymore”
- Stops playback after “update the”
- Truncates the response (discards “documentation” and any planned
edit_filetool calls) - Processes “Wait, don’t touch the documentation” as a new user turn
Steer Mode (Auto-Abort Disabled)
When disabled (Auto-abort OFF):- Playback stops immediately
- Your new speech is sent as a “steer” message to the server
- The server delivers the steer message after the current tool completes
- Remaining planned tool calls are skipped
- Refinements: “Add error handling to that function”
- Additional constraints: “Make sure it’s backward compatible”
- Follow-up instructions: “Also log the result”
steer behavior (from Pi RPC) queues your message to be injected after the current operation, allowing the assistant to apply your feedback mid-turn.
Example:
Assistant: “I’ll refactor the login function, run tests, and update the documentation—”
You (interrupt): “Add error handling too.”
Rubber Duck:
- Stops playback after “update the”
- Sends “Add error handling too” as a steer message
- The server finishes the current tool call (if running), then reads the steer message
- The assistant responds: “Got it, I’ll also add error handling.”
Choosing Between Modes
| Scenario | Recommended Mode |
|---|---|
| ”Stop, that’s completely wrong” | Abort |
| ”Actually, do X instead of Y” | Abort |
| ”Also add Z to that” | Steer |
| ”Make sure it’s thread-safe” | Steer |
| ”Cancel the current operation” | Abort |
Echo Cancellation and Barge-In
Barge-in reliability depends on echo cancellation — preventing the assistant’s voice from being picked up by your microphone and mistaken for user speech.Hardware AEC (VoiceProcessingIO)
When VoiceProcessingIO is active (default on supported devices), macOS applies hardware-level acoustic echo cancellation:- The microphone and speaker share a reference signal at the audio driver level
- Echo is cancelled in real time before reaching the capture buffer
- The microphone can stay open during playback without risk of echo feedback
- MacBook Pro (2016+)
- MacBook Air (2018+)
- iMac (2017+)
- Mac Studio, Mac mini (2020+)
AudioManager.swift:333-354.
Software AEC
If VoiceProcessingIO fails to initialize (e.g., due to external audio interfaces or Bluetooth devices), Rubber Duck falls back to software AEC:- The playback manager writes every PCM chunk to a ring buffer (
PlaybackReferenceBuffer) - The capture tap reads the ring buffer with an estimated delay (measured in samples)
- The reference signal is subtracted from the captured microphone signal using SIMD (Accelerate framework)
AudioManager.swift:425-453 and PlaybackReferenceBuffer.swift.
Software AEC uses adaptive gain calibration: the subtraction gain is continuously tuned based on the ratio of capture RMS to reference RMS during playback-only windows.
No AEC (Fallback)
If neither hardware nor software AEC is available (e.g., no playback reference buffer), Rubber Duck uses input muting:- The microphone is muted (
muteInput = true) when the assistant starts speaking - Capture continues (silence is sent to the server), so VAD stays active
- After playback finishes and drains, the microphone is unmuted with a delay (400ms + poll for queue drain)
VoiceSessionCoordinator.swift:432-459.
Suppression Windows
To further reduce false positives, Rubber Duck applies VAD suppression windows after the assistant stops speaking:- Post-playback suppression (no AEC only): 900ms after playback ends, any
speech_startedevents are ignored - Post-audio-delta guard (all modes): For 220ms (hardware AEC) or 450ms (no AEC) after the last audio delta,
speech_startedis ignored
VoiceSessionCoordinator.swift:832-863.
These suppression windows are adaptive: they’re automatically tuned based on your echo cancellation mode.
Interruption Race Conditions
Barge-in involves precise timing between client playback state and server conversation state. Race conditions can occur:1. Truncate-After-Completion
You interrupt just as the assistant finishes speaking. Your truncate request arrives after the server has already marked the response as done. Server error:item_truncate_invalid_item_id or already shorter than
Rubber Duck behavior: Ignores the error (classified as benign), transitions to listening.
Implemented in VoiceSessionCoordinator.swift:1146-1165.
2. Double-Response Race
You interrupt, but the server has already started generating a new response due to server-side VAD. Server error:conversation_already_has_active_response
Rubber Duck behavior: Ignores the error (the server will handle the conflict), transitions to listening.
3. Cancel-After-Abort
You send a cancel request, but the server has already aborted the response due to a previous truncate. Server error:response_cancel_not_active
Rubber Duck behavior: Ignores the error, transitions to listening.
All benign race errors are logged as logInfo (not logError) to reduce noise.
Debugging Barge-In Issues
Barge-In Not Triggering
Symptoms: You speak during playback, but the assistant doesn’t stop. Possible causes:- Echo: Your voice is being masked by echo. Check if AEC is active (
duck doctor). - Confirmation delay too long: Reduce the confirmation delay in Settings.
- Microphone muted: Check if software muting is active (without AEC, input is muted during playback).
- VAD suppression window: You spoke too soon after the last audio delta (< 220ms). Wait slightly longer.
False Barge-Ins (Echo Triggering Interruption)
Symptoms: The assistant stops speaking even though you didn’t say anything. Possible causes:- Echo bleed: The assistant’s voice is reaching the microphone and triggering VAD.
- Confirmation delay too short: Increase the confirmation delay in Settings.
- Background noise: Ambient sound is triggering VAD during playback.
- Use hardware AEC: Check
duck doctor— if hardware AEC is not active, try disconnecting external audio devices. - Increase confirmation delay: Go to Settings > Voice and increase the barge-in confirmation delay to 500ms or more.
- Use a directional microphone: Reduce room echo with acoustic treatment or a cardioid mic.
Barge-In Position Incorrect
Symptoms: The conversation tree shows text you didn’t hear (the truncation point is too late). Possible causes:- Playback buffer lag: The playback position is ahead of what you actually heard.
- Audio device latency: External speakers/headphones introduce output delay.
CLI Visibility
When barge-in occurs, the CLI prints a line indicating the action taken:Barge-in in the CLI output to see all interruptions in a session.
Related
- Voice Interface - How voice activation and speech detection work
- Sessions - Managing multiple sessions and background runs
- CLI Commands - Using the CLI alongside voice