Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/npateriya/LocalVoiceAI/llms.txt

Use this file to discover all available pages before exploring further.

LocalVoiceAI is a push-to-talk voice input tool that runs entirely on your Mac — no cloud, no API calls, no subscription. When you hold F10, a C-level event tap detects the keypress and signals a Go process to begin capturing audio through PortAudio. On release, the captured PCM samples are encoded into a WAV file and handed off to whisper-cli, which runs the Whisper small model on Apple Metal GPU in roughly one to two seconds. The resulting text is filtered for non-speech noise, piped to pbcopy, and then injected into whatever window is focused via a synthesized Cmd+V keystroke. Every step happens locally on your machine.

Full Pipeline

1

Key Monitoring

A C-level CGEventTap runs on a dedicated C pthread — not a goroutine. It is created with three flags: kCGSessionEventTap (session-level scope), kCGHeadInsertEventTap (inserted at the head of the event tap list), and kCGEventTapOptionListenOnly (passive observation only). It registers a callback for kCGEventKeyDown and kCGEventKeyUp events and watches for the configured push-to-talk keycode (default: F10, keycode 109).When the key is pressed, the callback checks CGEventGetIntegerValueField(event, kCGKeyboardEventAutorepeat) — if the value is non-zero the event is a key-repeat generated by the OS and is ignored. Only the first physical keydown writes a D byte to the write end of the Unix pipe. On release it writes U. If CGEventTapCreate fails due to missing permissions, the callback writes E and the Go process exits with a human-readable error message pointing to System Settings.Go reads from the read end of this pipe in a syscall.Read loop on the main goroutine:
// 'D' = key down  → start recording
// 'U' = key up    → stop recording, transcribe
// 'E' = tap failed → permissions missing
switch buf[0] {
case 'D':
    if !optionDown { optionDown = true; onKeyDown() }
case 'U':
    if optionDown { optionDown = false; onKeyUp() }
case 'E':
    // log error, os.Exit(1)
}
The CGEventTap is created with kCGEventTapOptionListenOnly, meaning LocalVoiceAI only observes key events — it never blocks or swallows them. Your normal keyboard input is completely unaffected while recording.
2

Audio Capture

onKeyDown() calls portaudio.DefaultInputDevice() to read the microphone’s native sample rate (typically 48000 Hz or 44100 Hz), then opens a mono float32 stream with 1024 frames per buffer using portaudio.OpenDefaultStream. A goroutine continuously calls stream.Read() and appends each buffer of []float32 samples into a shared slice guarded by a sync.Mutex.
go func() {
    for {
        if err := stream.Read(); err != nil { return }
        mu.Lock()
        samples = append(samples, buf...)
        mu.Unlock()
    }
}()
Using the device’s native sample rate is intentional — it avoids PortAudio’s internal software resampler, which can introduce artifacts that noticeably degrade Whisper’s transcription accuracy on short phrases.
3

WAV Encoding

onKeyUp() stops and closes the PortAudio stream, then copies the captured samples under the mutex. If fewer than actualRate / 4 samples were collected (less than ~250 ms of audio), the recording is silently discarded as too short.Otherwise, transcribeAndPaste is launched as a goroutine (so the main pipe loop stays responsive). It creates a temporary file (os.CreateTemp, pattern whisper-*.wav) and calls writeWAV, which writes a standard RIFF/WAV header followed by 16-bit PCM samples converted from float32:
v := int16(s * math.MaxInt16)
f.Write([]byte{byte(v), byte(v >> 8)})
The result is a 16-bit, mono, little-endian WAV at the mic’s native sample rate, ready for whisper-cli.
4

Transcription

whisper-cli is invoked as a subprocess with the following flags:
whisper-cli \
  -m ~/.cache/localvoice/ggml-small.bin \
  -f /tmp/whisper-XXXXXX.wav \
  --no-timestamps \
  -otxt \
  --output-file /tmp/whisper-XXXXXX.wav \
  --no-speech-thold 0.8
FlagPurpose
-m ~/.cache/localvoice/ggml-small.binPath to the ggml-small Whisper model (~244 MB)
-f <wav>Input audio file
--no-timestampsOmit [00:00.000] timestamp prefixes from output
-otxtWrite result to a .txt sidecar file
--output-file <wav>Output base path (sidecar becomes <wav>.txt)
--no-speech-thold 0.8Silence probability threshold — see step 5
The model is downloaded automatically on first run from HuggingFace (ggerganov/whisper.cpp) if it is not already present in ~/.cache/localvoice/. Inference runs on the Apple Metal GPU via the ggml-metal backend, taking roughly one to two seconds for a short phrase.
5

Non-Speech Filtering

After whisper-cli writes its output, the .txt sidecar is read and trimmed. Before pasting anything, the text is checked against a compiled regular expression that matches Whisper’s noise/non-speech annotations:
var whisperAnnotation = regexp.MustCompile(
    `^[\s\(\[♪\*]*[^a-zA-Z]*[\)\]♪\*]*$|^\s*\([^)]*\)\s*$`,
)
This pattern catches outputs like (music), (phone buzzing), (applause), , and similar strings that contain no real Latin alphabet words. Matches are silently discarded with a [SKIP] log entry.The --no-speech-thold 0.8 flag passed to whisper-cli provides a complementary layer — it raises Whisper’s internal no-speech probability threshold above the default, causing the model itself to suppress more background-noise segments before any text is produced. Together, both filters significantly reduce spurious clipboard writes when the microphone picks up ambient sound.
6

Paste

Valid transcript text is piped to pbcopy to place it on the macOS clipboard:
pb := exec.Command("pbcopy")
pb.Stdin = strings.NewReader(text)
pb.Run()
Then CGEventPost synthesizes a Cmd+V keystroke sequence into the currently focused window using four discrete keyboard events — in order: Command key down, V key down (with kCGEventFlagMaskCommand), V key up, Command key up:
CGEventRef cmdDown = CGEventCreateKeyboardEvent(NULL, kVK_Command, true);
CGEventRef vDown   = CGEventCreateKeyboardEvent(NULL, kVK_ANSI_V, true);
CGEventRef vUp     = CGEventCreateKeyboardEvent(NULL, kVK_ANSI_V, false);
CGEventRef cmdUp   = CGEventCreateKeyboardEvent(NULL, kVK_Command, false);
CGEventSetFlags(vDown, kCGEventFlagMaskCommand);
CGEventSetFlags(vUp,   kCGEventFlagMaskCommand);
CGEventPost(kCGAnnotatedSessionEventTap, cmdDown);
CGEventPost(kCGAnnotatedSessionEventTap, vDown);
CGEventPost(kCGAnnotatedSessionEventTap, vUp);
CGEventPost(kCGAnnotatedSessionEventTap, cmdUp);
The text appears at the cursor in whatever app was last focused — Claude, Cursor, VS Code, a browser text field, a terminal, anything.

Architecture Component Overview

ComponentTechnologyPurpose
Key monitoringCGEventTap (C pthread)Detect F10 press/release system-wide
Audio capturePortAudio (Go + CGo)Record microphone at native sample rate
Transcriptionwhisper-cli (whisper.cpp)On-device speech-to-text
GPU accelerationApple Metal (ggml-metal)Fast model inference (~1–2 s)
ClipboardpbcopyCopy transcript text to clipboard
Paste simulationCGEventPostSynthesize Cmd+V into the focused window
ServicemacOS LaunchAgentTerminal-independent, persistent operation

Why CGo + a C Pthread for Key Monitoring?

CGEventTap requires a CFRunLoop to deliver events. CFRunLoop must be driven on the same OS thread that created it, and it must call CFRunLoopRun() — which blocks that thread indefinitely. Go’s scheduler regularly migrates goroutines between OS threads (GOMAXPROCS). Calling CFRunLoopRun() inside a goroutine would pin the Go scheduler to one OS thread and cause subtle, hard-to-reproduce deadlocks as other goroutines try to run. The solution is to create the CGEventTap and run CFRunLoopRun() entirely inside a C-owned pthread, allocated once and then detached:
static void* eventThread(void *arg) {
    // ... create tap, add to run loop source ...
    CFRunLoopRun();  // owns this thread forever
    return NULL;
}

static int startMonitoring() {
    int fds[2];
    pipe(fds);
    g_write_fd = fds[1];
    pthread_t t;
    pthread_create(&t, NULL, eventThread, NULL);
    pthread_detach(t);
    return fds[0];   // read end returned to Go
}
The C thread owns its CFRunLoop forever. Go never touches that thread. The only communication between the two sides is through the Unix pipe: C writes single-byte signals, Go reads them. This is a safe, minimal IPC boundary with no shared memory.

Concurrency Model

LocalVoiceAI uses three concurrent actors:

C pthread (CFRunLoop)

Runs CFRunLoopRun() forever. Writes D, U, or E bytes to the pipe write-end on key events. Never calls into Go.

Main goroutine (pipe reader)

Loops on syscall.Read from the pipe read-end. Calls onKeyDown() and onKeyUp() synchronously. Owns the sync.Mutex-protected samples slice and stream pointer.

Recording goroutine

Spawned by onKeyDown(). Calls stream.Read() in a tight loop, appending frames to samples under the mutex. Exits when stream is set to nil by onKeyUp().
A fourth, short-lived goroutine is spawned by onKeyUp() to run transcribeAndPaste — this keeps the pipe-reading loop unblocked during the ~1–2 second whisper-cli subprocess call.
All access to samples, stream, actualRate, and recordStart is protected by a single sync.Mutex. The recording goroutine checks whether stream == nil on each iteration to know when to stop — onKeyUp sets it to nil before the goroutine reads it, ensuring a clean shutdown without a separate channel or WaitGroup.

Build docs developers (and LLMs) love