LocalVoiceAI is a push-to-talk voice input tool that runs entirely on your Mac — no cloud, no API calls, no subscription. When you hold F10, a C-level event tap detects the keypress and signals a Go process to begin capturing audio through PortAudio. On release, the captured PCM samples are encoded into a WAV file and handed off toDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/npateriya/LocalVoiceAI/llms.txt
Use this file to discover all available pages before exploring further.
whisper-cli, which runs the Whisper small model on Apple Metal GPU in roughly one to two seconds. The resulting text is filtered for non-speech noise, piped to pbcopy, and then injected into whatever window is focused via a synthesized Cmd+V keystroke. Every step happens locally on your machine.
Full Pipeline
Key Monitoring
A C-level
CGEventTap runs on a dedicated C pthread — not a goroutine. It is created with three flags: kCGSessionEventTap (session-level scope), kCGHeadInsertEventTap (inserted at the head of the event tap list), and kCGEventTapOptionListenOnly (passive observation only). It registers a callback for kCGEventKeyDown and kCGEventKeyUp events and watches for the configured push-to-talk keycode (default: F10, keycode 109).When the key is pressed, the callback checks CGEventGetIntegerValueField(event, kCGKeyboardEventAutorepeat) — if the value is non-zero the event is a key-repeat generated by the OS and is ignored. Only the first physical keydown writes a D byte to the write end of the Unix pipe. On release it writes U. If CGEventTapCreate fails due to missing permissions, the callback writes E and the Go process exits with a human-readable error message pointing to System Settings.Go reads from the read end of this pipe in a syscall.Read loop on the main goroutine:The
CGEventTap is created with kCGEventTapOptionListenOnly, meaning LocalVoiceAI only observes key events — it never blocks or swallows them. Your normal keyboard input is completely unaffected while recording.Audio Capture
onKeyDown() calls portaudio.DefaultInputDevice() to read the microphone’s native sample rate (typically 48000 Hz or 44100 Hz), then opens a mono float32 stream with 1024 frames per buffer using portaudio.OpenDefaultStream. A goroutine continuously calls stream.Read() and appends each buffer of []float32 samples into a shared slice guarded by a sync.Mutex.WAV Encoding
onKeyUp() stops and closes the PortAudio stream, then copies the captured samples under the mutex. If fewer than actualRate / 4 samples were collected (less than ~250 ms of audio), the recording is silently discarded as too short.Otherwise, transcribeAndPaste is launched as a goroutine (so the main pipe loop stays responsive). It creates a temporary file (os.CreateTemp, pattern whisper-*.wav) and calls writeWAV, which writes a standard RIFF/WAV header followed by 16-bit PCM samples converted from float32:whisper-cli.Transcription
whisper-cli is invoked as a subprocess with the following flags:| Flag | Purpose |
|---|---|
-m ~/.cache/localvoice/ggml-small.bin | Path to the ggml-small Whisper model (~244 MB) |
-f <wav> | Input audio file |
--no-timestamps | Omit [00:00.000] timestamp prefixes from output |
-otxt | Write result to a .txt sidecar file |
--output-file <wav> | Output base path (sidecar becomes <wav>.txt) |
--no-speech-thold 0.8 | Silence probability threshold — see step 5 |
ggerganov/whisper.cpp) if it is not already present in ~/.cache/localvoice/. Inference runs on the Apple Metal GPU via the ggml-metal backend, taking roughly one to two seconds for a short phrase.Non-Speech Filtering
After This pattern catches outputs like
whisper-cli writes its output, the .txt sidecar is read and trimmed. Before pasting anything, the text is checked against a compiled regular expression that matches Whisper’s noise/non-speech annotations:(music), (phone buzzing), (applause), ♪, and similar strings that contain no real Latin alphabet words. Matches are silently discarded with a [SKIP] log entry.The --no-speech-thold 0.8 flag passed to whisper-cli provides a complementary layer — it raises Whisper’s internal no-speech probability threshold above the default, causing the model itself to suppress more background-noise segments before any text is produced. Together, both filters significantly reduce spurious clipboard writes when the microphone picks up ambient sound.Paste
Valid transcript text is piped to Then The text appears at the cursor in whatever app was last focused — Claude, Cursor, VS Code, a browser text field, a terminal, anything.
pbcopy to place it on the macOS clipboard:CGEventPost synthesizes a Cmd+V keystroke sequence into the currently focused window using four discrete keyboard events — in order: Command key down, V key down (with kCGEventFlagMaskCommand), V key up, Command key up:Architecture Component Overview
| Component | Technology | Purpose |
|---|---|---|
| Key monitoring | CGEventTap (C pthread) | Detect F10 press/release system-wide |
| Audio capture | PortAudio (Go + CGo) | Record microphone at native sample rate |
| Transcription | whisper-cli (whisper.cpp) | On-device speech-to-text |
| GPU acceleration | Apple Metal (ggml-metal) | Fast model inference (~1–2 s) |
| Clipboard | pbcopy | Copy transcript text to clipboard |
| Paste simulation | CGEventPost | Synthesize Cmd+V into the focused window |
| Service | macOS LaunchAgent | Terminal-independent, persistent operation |
Why CGo + a C Pthread for Key Monitoring?
CGEventTap requires a CFRunLoop to deliver events. CFRunLoop must be driven on the same OS thread that created it, and it must call CFRunLoopRun() — which blocks that thread indefinitely.
Go’s scheduler regularly migrates goroutines between OS threads (GOMAXPROCS). Calling CFRunLoopRun() inside a goroutine would pin the Go scheduler to one OS thread and cause subtle, hard-to-reproduce deadlocks as other goroutines try to run.
The solution is to create the CGEventTap and run CFRunLoopRun() entirely inside a C-owned pthread, allocated once and then detached:
CFRunLoop forever. Go never touches that thread. The only communication between the two sides is through the Unix pipe: C writes single-byte signals, Go reads them. This is a safe, minimal IPC boundary with no shared memory.
Concurrency Model
LocalVoiceAI uses three concurrent actors:C pthread (CFRunLoop)
Runs
CFRunLoopRun() forever. Writes D, U, or E bytes to the pipe write-end on key events. Never calls into Go.Main goroutine (pipe reader)
Loops on
syscall.Read from the pipe read-end. Calls onKeyDown() and onKeyUp() synchronously. Owns the sync.Mutex-protected samples slice and stream pointer.Recording goroutine
Spawned by
onKeyDown(). Calls stream.Read() in a tight loop, appending frames to samples under the mutex. Exits when stream is set to nil by onKeyUp().onKeyUp() to run transcribeAndPaste — this keeps the pipe-reading loop unblocked during the ~1–2 second whisper-cli subprocess call.
All access to
samples, stream, actualRate, and recordStart is protected by a single sync.Mutex. The recording goroutine checks whether stream == nil on each iteration to know when to stop — onKeyUp sets it to nil before the goroutine reads it, ensuring a clean shutdown without a separate channel or WaitGroup.