Overview
OpenWhispr integrates with multiple AI providers for intelligent text processing when you address your named agent (e.g., “Hey Jarvis, summarize this”).Provider Types
Cloud Providers
OpenAI, Anthropic, Google Gemini, Groq - API-based models
Local Providers
Qwen, Llama, Mistral, Gemma, GPT-OSS - Privacy-first GGUF models via llama.cpp
Cloud Providers
OpenAI
openaihttps://api.openai.com/v1/responses (Responses API) or /chat/completions (fallback)Available Models
GPT-5 Series (Latest)
GPT-5 Series (Latest)
GPT-5.2 (
gpt-5.2)- Latest flagship reasoning model
- Best for complex tasks requiring deep reasoning
gpt-5-mini)- Fast and cost-efficient
- Good balance for most use cases
gpt-5-nano)- Ultra-fast, low latency
- Best for real-time processing
GPT-4.1 Series
GPT-4.1 Series
GPT-4.1 (
gpt-4.1)- Strong baseline model
- 1M token context window
gpt-4.1-mini)- Smaller, faster version
- Good for shorter tasks
gpt-4.1-nano)- Lowest latency GPT-4.1 variant
GPT-5 and o-series models use the new Responses API (September 2025). The system automatically falls back to Chat Completions API for older models or if Responses API is unavailable.
Anthropic
anthropichttps://api.anthropic.com/v1/messages (via IPC bridge to avoid CORS)Available Models
Claude Opus 4.6 (claude-opus-4-6)
- Most capable Claude model
- Best for complex reasoning tasks
claude-sonnet-4-6)
- Balanced performance and speed
- Recommended for general use
claude-haiku-4-5)
- Fast with near-frontier intelligence
- Best for quick tasks
Anthropic API calls are routed through the main process via IPC to avoid CORS restrictions in the renderer process.
Google Gemini
geminihttps://generativelanguage.googleapis.com/v1betaAvailable Models
Gemini 3.1 Pro (gemini-3.1-pro-preview)
- Next-gen flagship model for complex reasoning
- Largest context window
- 2000+ token minimum output
gemini-3-flash-preview)
- Ultra-fast, high-capability next-gen model
- Good balance of speed and intelligence
gemini-2.5-flash-lite)
- Lowest latency and cost
- Best for simple cleanup tasks
Groq
groqhttps://api.groq.com/openai/v1/chat/completionsAvailable Models
Qwen Models
Qwen Models
Qwen3 32B (
qwen/qwen3-32b)- Powerful reasoning model
- 131K context window
- Thinking mode disabled for speed
OpenAI OSS Models (via Groq)
OpenAI OSS Models (via Groq)
GPT-OSS 120B (
openai/gpt-oss-120b)- OpenAI’s open-source flagship
- 500 tokens/sec throughput
openai/gpt-oss-20b)- Fast open-source model
- 1000 tokens/sec throughput
Meta Llama Models
Meta Llama Models
LLaMA 3.3 70B (
llama-3.3-70b-versatile)- Meta’s versatile model
- 280 tokens/sec
llama-3.1-8b-instant)- Ultra-fast: 560 tokens/sec
- 131K context window
meta-llama/llama-4-scout-17b-16e-instruct)- Meta’s efficient multimodal model
- 750 tokens/sec
Groq Compound Models
Groq Compound Models
Compound (
groq/compound)- Groq’s compound system
- 450 tokens/sec
groq/compound-mini)- Fast compound system
- 3x lower latency
Moonshot AI
Moonshot AI
Kimi K2 0905 (
moonshotai/kimi-k2-instruct-0905)- Moonshot AI’s 1T MoE model
- 256K context window
Local Providers
Local models run entirely on your device using llama.cpp for maximum privacy. All models are in GGUF format.Qwen (Alibaba)
qwenhttps://huggingface.coChatML format:
<|im_start|>system\n{system}<|im_end|>\n<|im_start|>user\n{user}<|im_end|>\n<|im_start|>assistant\nQwen3 Series (Latest - Thinking Mode Support)
Qwen3 8B (Recommended)
Qwen3 8B (Recommended)
Model ID:
qwen3-8b-q4_k_m5.0GB
Q4_K_M (4-bit medium quantization)
131,072 tokens
Qwen/Qwen3-8B-GGUFQwen3-8B-Q4_K_M.ggufLatest Qwen3 with thinking mode support. Best for general reasoning tasks.
trueOther Qwen3 Models
Other Qwen3 Models
- Qwen3 8B (Q5) (
qwen3-8b-q5_k_m): 5.9GB, higher quality - Qwen3 4B (
qwen3-4b-q4_k_m): 2.5GB, compact with reasoning - Qwen3 1.7B (
qwen3-1.7b-q8_0): 1.8GB, small but capable - Qwen3 0.6B (
qwen3-0.6b-q8_0): 0.6GB, for edge devices - Qwen3 32B (
qwen3-32b-q4_k_m): 19.8GB, most powerful local model
Qwen2.5 Series (Legacy)
Qwen2.5 Series (Legacy)
- Qwen2.5 7B (
qwen2.5-7b-instruct-q4_k_m): 4.7GB, 128K context - Qwen2.5 7B (Q5) (
qwen2.5-7b-instruct-q5_k_m): 5.4GB, higher quality - Qwen2.5 3B (
qwen2.5-3b-instruct-q5_k_m): 2.4GB, balanced - Qwen2.5 1.5B (
qwen2.5-1.5b-instruct-q5_k_m): 1.3GB, basic tasks - Qwen2.5 0.5B (
qwen2.5-0.5b-instruct-q5_k_m): 0.5GB, fastest
Mistral AI
mistralMistral format:
[INST] {system}\n\n{user} [/INST]mistral-7b-instruct-v0.3-q4_k_m) — Recommended
- Size: 4.4GB
- Context: 32,768 tokens
- Fast and efficient instruction model
mistral-7b-instruct-v0.3-q5_k_m)
- Size: 5.1GB
- Higher quality version
Meta Llama
llamaLlama 3 format:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n{system}<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{user}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nllama-3.2-3b-instruct-q4_k_m) — Recommended
- Size: 2.0GB
- Context: 131,072 tokens
- Small but capable multilingual model
llama-3.2-1b-instruct-q4_k_m)
- Size: 0.8GB
- Tiny model for edge devices
llama-3.1-8b-instruct-q4_k_m)
- Size: 4.9GB
- Powerful model with great performance
OpenAI OSS
openai-ossChatML format (same as Qwen)
gpt-oss-20b-mxfp4) — Recommended
- Size: 12.1GB
- Quantization: MXFP4 (4-bit microscaling float)
- Context: 128,000 tokens
- OpenAI’s open-weight model for consumer hardware
Gemma (Google)
gemmaGemma format:
<bos><start_of_turn>user\n{system}\n\n{user}<end_of_turn>\n<start_of_turn>model\ngemma-3-4b-it-q4_k_m) — Recommended
- Size: 2.49GB
- Context: 131,072 tokens
- Great balance of speed and quality
gemma-3-1b-it-q4_k_m)
- Size: 0.81GB
- Ultra-fast, best for short dictation cleanup
Using AI Models
Via ReasoningService
API Call Example (OpenAI)
Local Model Download
Check Local Model Availability
Model Registry
All models are defined insrc/models/modelRegistryData.json as a single source of truth:
Model Provider Detection
API Key Management
API keys are stored in environment variables and automatically reloaded on app start. Keys are cached in memory during runtime for better performance.
Custom Reasoning Endpoint
For self-hosted or custom OpenAI-compatible APIs:Custom endpoints must use HTTPS (HTTP only allowed for local network:
localhost, 127.0.0.1, 192.168.*, 10.*).Thinking Mode
Models with Thinking/Reasoning Support
Models with Thinking/Reasoning Support
Cloud Models:
- GPT-5 series (via Responses API)
- Claude Opus/Sonnet 4.6 (extended thinking)
- Gemini 3.1 Pro (reasoning mode)
- Qwen3 series (thinking mode in ChatML format)
- GPT-OSS 20B (reasoning capabilities)
- Groq Qwen models (set
reasoning_effort: "none"for faster inference)
Token Limits
max_tokens based on input length.