Voice Input (Whisper)

SuperCmd integrates OpenAI’s Whisper API to provide accurate speech-to-text transcription, allowing you to dictate commands, write text, and control your computer with your voice.

Overview

Voice input in SuperCmd uses the Whisper API to transcribe your speech in real-time. The transcribed text is automatically inserted at your cursor position or used as a command in SuperCmd.

Whisper voice input requires an OpenAI API key. You’ll be prompted to configure this on first use.

Getting Started

Configure API Key

Open Settings (Cmd+,) > AI tab and enter your OpenAI API key.

Activate Voice Input

Press the Fn key (or your configured voice input hotkey) to open the Whisper overlay.

Start Speaking

When the overlay appears, start speaking. Your audio is captured in real-time.

Finish Recording

Release the Fn key or click Stop to end recording and transcribe your speech.

How It Works

The Whisper integration is managed by the useWhisperManager hook (src/renderer/src/hooks/useWhisperManager.ts):

Architecture

// Voice input state management
export interface UseWhisperManagerReturn {
  whisperOnboardingPracticeText: string;      // Accumulated transcription
  whisperSpeakToggleLabel: string;            // Toggle button label (fn/speak)
  whisperSessionRef: React.MutableRefObject;  // Active session tracker
  whisperPortalTarget: HTMLElement | null;    // Detached overlay window
}

Recording Flow

Activation: Fn key press opens detached overlay window (620x88px, bottom-center)
Audio Capture: Browser MediaRecorder API captures audio from microphone
Transcription: Audio buffer sent to OpenAI Whisper API (src/main/ai-provider.ts:539)
Text Insertion: Transcribed text inserted at cursor or into SuperCmd

Whisper API Integration

From ai-provider.ts:539, the transcription process:

export function transcribeAudio(opts: TranscribeOptions): Promise<string> {
  // Multipart form upload to OpenAI Whisper API
  // Supports: wav, mp3, m4a, ogg, flac, webm
  // Returns plain text transcription
}

Audio is sent directly to OpenAI and is not stored locally. Recording sessions are ephemeral.

Using Voice Input

Dictation Mode

Insert text anywhere:

Position your cursor in any text field
Press and hold Fn
Speak your text
Release Fn to transcribe and insert

Command Mode

Use voice to run SuperCmd commands:

Press your SuperCmd hotkey to open the launcher
Press Fn to activate voice input
Speak a command name (e.g., “Open Spotify”)
SuperCmd will search and execute the matching command

Continuous Recording

For longer dictation:

Press Fn to start
Click the overlay to toggle hold mode
Speak continuously
Click Stop when finished

Settings

Voice Input Hotkey

Customize the activation key:

Open Settings > General
Set Voice Input Hotkey (default: Fn)
Options include: Fn, Right Cmd, Right Option, Right Shift

Whisper Model

Configure the transcription model:

// Available Whisper models
whisper-1 // Default: Fast, accurate for most languages

The Whisper API supports automatic language detection, so you don’t need to specify your language.

Language Settings

Optionally specify a language for better accuracy:

Open Settings > AI
Set Whisper Language (optional)
Use ISO 639-1 codes (e.g., en, es, fr, de)

Overlay Window

The Whisper overlay is a detached window managed by useDetachedPortalWindow (src/renderer/src/useDetachedPortalWindow.ts):

Window Specifications

Position: Bottom-center of screen
Size: 620×88 pixels
Style: Transparent, frameless
Behavior: Auto-closes on blur or Escape

Visual States

Listening
Processing
Complete
Error

Animated waveform indicates active recording

Audio Format Support

Whisper accepts multiple audio formats (src/main/ai-provider.ts:529):

function resolveUploadMeta(mimeType?: string) {
  // Supported formats:
  // - audio/wav
  // - audio/mpeg (mp3)
  // - audio/mp4 (m4a)
  // - audio/ogg
  // - audio/flac
  // - audio/webm (default browser recording)
}

SuperCmd automatically detects and converts your browser’s recording format.

Best Practices

Speak Clearly

Enunciate words clearly for better transcription accuracy

Use Quiet Environment

Reduce background noise for cleaner audio

Short Segments

Keep recordings under 30 seconds for faster transcription

Review First

Check transcribed text before sending or saving

Keyboard Shortcuts

Action	Shortcut
Start/Stop Recording	`Fn` (hold)
Cancel Recording	`Escape`
Toggle Hold Mode	Click overlay

Onboarding Practice

First-time users are guided through an onboarding flow:

Grant Microphone Permission

Browser will request microphone access

Practice Speaking

Try a test phrase to verify setup

Review Transcription

See your practice text transcribed in real-time (src/renderer/src/hooks/useWhisperManager.ts:91)

Start Using

Complete onboarding and start using voice input

Troubleshooting

No microphone detected

Grant microphone permission in System Settings > Privacy & Security > Microphone
Ensure SuperCmd is checked in the list
Restart SuperCmd after granting permission

Poor transcription quality

Check microphone input level in System Settings > Sound
Reduce background noise
Speak more slowly and clearly
Try adjusting microphone position

Transcription is slow

API response time varies based on audio length
Check your internet connection
Verify OpenAI API key is valid

API errors

Verify OpenAI API key in Settings > AI
Check API quota and billing status
Ensure API key has Whisper API access

Privacy & Security

Audio recordings are sent to OpenAI’s Whisper API for transcription. Audio is not stored by SuperCmd locally, but OpenAI may retain data according to their privacy policy.

What’s Sent

Raw audio recording (duration varies)
Language hint (if configured)
Model selection (whisper-1)

What’s Not Sent

No personal identifiers
No app context or metadata
No previous recordings

Data Retention

According to OpenAI’s policy:

API requests may be retained for abuse monitoring
Audio is not used for model training (as of March 2024)
See OpenAI Privacy Policy for details

You can disable voice input entirely in Settings > General if you prefer not to use this feature.

Advanced Usage

Text Accumulation

The appendWhisperOnboardingPracticeText function (src/renderer/src/hooks/useWhisperManager.ts:91) intelligently concatenates transcription chunks:

appendWhisperOnboardingPracticeText((chunk: string) => {
  // Adds smart spacing between words
  // Prevents double spaces
  // Handles punctuation properly
});

Session Management

Voice sessions are tracked to prevent launcher interference:

whisperSessionRef.current = true; // Active session
// Suppresses launcher reset logic during recording

Cost Considerations

Whisper API pricing (as of 2024):

$0.006 per minute of audio
Average 10-second recording: ~$0.001
Monthly heavy usage (500 recordings): ~$5

Monitor your OpenAI usage dashboard to track Whisper API costs.

Get Started

Core Features

Configuration

Extensions

Voice Input (Whisper)

Overview

Getting Started

How It Works

Architecture

Recording Flow

Whisper API Integration

Using Voice Input

Dictation Mode

Command Mode

Continuous Recording

Settings

Voice Input Hotkey

Whisper Model

Language Settings

Overlay Window

Window Specifications

Visual States

Audio Format Support

Best Practices

Speak Clearly

Use Quiet Environment

Short Segments

Review First

Keyboard Shortcuts

Onboarding Practice

Troubleshooting

Privacy & Security

What’s Sent

What’s Not Sent

Data Retention

Advanced Usage

Text Accumulation

Session Management

Cost Considerations

Build docs developers (and LLMs) love

Get Started

Core Features

Configuration

Extensions

​Overview

​Getting Started

​How It Works

​Architecture

​Recording Flow

​Whisper API Integration

​Using Voice Input

​Dictation Mode

​Command Mode

​Continuous Recording

​Settings

​Voice Input Hotkey

​Whisper Model

​Language Settings

​Overlay Window

​Window Specifications

​Visual States

​Audio Format Support

​Best Practices

Speak Clearly

Use Quiet Environment

Short Segments

Review First

​Keyboard Shortcuts

​Onboarding Practice

​Troubleshooting

​Privacy & Security

​What’s Sent

​What’s Not Sent

​Data Retention

​Advanced Usage

​Text Accumulation

​Session Management

​Cost Considerations

Build docs developers (and LLMs) love

Overview

Getting Started

How It Works

Architecture

Recording Flow

Whisper API Integration

Using Voice Input

Dictation Mode

Command Mode

Continuous Recording

Settings

Voice Input Hotkey

Whisper Model

Language Settings

Overlay Window

Window Specifications

Visual States

Audio Format Support

Best Practices

Keyboard Shortcuts

Onboarding Practice

Troubleshooting

Privacy & Security

What’s Sent

What’s Not Sent

Data Retention

Advanced Usage

Text Accumulation

Session Management

Cost Considerations