Skip to main content

Overview

OpenWhispr integrates with multiple AI providers for intelligent text processing when you address your named agent (e.g., “Hey Jarvis, summarize this”).

Provider Types

Cloud Providers

OpenAI, Anthropic, Google Gemini, Groq - API-based models

Local Providers

Qwen, Llama, Mistral, Gemma, GPT-OSS - Privacy-first GGUF models via llama.cpp

Cloud Providers

OpenAI

id
string
openai
endpoint
string
https://api.openai.com/v1/responses (Responses API) or /chat/completions (fallback)

Available Models

GPT-5.2 (gpt-5.2)
  • Latest flagship reasoning model
  • Best for complex tasks requiring deep reasoning
GPT-5 Mini (gpt-5-mini)
  • Fast and cost-efficient
  • Good balance for most use cases
GPT-5 Nano (gpt-5-nano)
  • Ultra-fast, low latency
  • Best for real-time processing
GPT-4.1 (gpt-4.1)
  • Strong baseline model
  • 1M token context window
GPT-4.1 Mini (gpt-4.1-mini)
  • Smaller, faster version
  • Good for shorter tasks
GPT-4.1 Nano (gpt-4.1-nano)
  • Lowest latency GPT-4.1 variant
GPT-5 and o-series models use the new Responses API (September 2025). The system automatically falls back to Chat Completions API for older models or if Responses API is unavailable.

Anthropic

id
string
anthropic
endpoint
string
https://api.anthropic.com/v1/messages (via IPC bridge to avoid CORS)

Available Models

Claude Opus 4.6 (claude-opus-4-6)
  • Most capable Claude model
  • Best for complex reasoning tasks
Claude Sonnet 4.6 (claude-sonnet-4-6)
  • Balanced performance and speed
  • Recommended for general use
Claude Haiku 4.5 (claude-haiku-4-5)
  • Fast with near-frontier intelligence
  • Best for quick tasks
Anthropic API calls are routed through the main process via IPC to avoid CORS restrictions in the renderer process.

Google Gemini

id
string
gemini
endpoint
string
https://generativelanguage.googleapis.com/v1beta

Available Models

Gemini 3.1 Pro (gemini-3.1-pro-preview)
  • Next-gen flagship model for complex reasoning
  • Largest context window
  • 2000+ token minimum output
Gemini 3 Flash (gemini-3-flash-preview)
  • Ultra-fast, high-capability next-gen model
  • Good balance of speed and intelligence
Gemini 2.5 Flash Lite (gemini-2.5-flash-lite)
  • Lowest latency and cost
  • Best for simple cleanup tasks

Groq

id
string
groq
endpoint
string
https://api.groq.com/openai/v1/chat/completions

Available Models

Qwen3 32B (qwen/qwen3-32b)
  • Powerful reasoning model
  • 131K context window
  • Thinking mode disabled for speed
GPT-OSS 120B (openai/gpt-oss-120b)
  • OpenAI’s open-source flagship
  • 500 tokens/sec throughput
GPT-OSS 20B (openai/gpt-oss-20b)
  • Fast open-source model
  • 1000 tokens/sec throughput
LLaMA 3.3 70B (llama-3.3-70b-versatile)
  • Meta’s versatile model
  • 280 tokens/sec
LLaMA 3.1 8B (llama-3.1-8b-instant)
  • Ultra-fast: 560 tokens/sec
  • 131K context window
Llama 4 Scout (meta-llama/llama-4-scout-17b-16e-instruct)
  • Meta’s efficient multimodal model
  • 750 tokens/sec
Compound (groq/compound)
  • Groq’s compound system
  • 450 tokens/sec
Compound Mini (groq/compound-mini)
  • Fast compound system
  • 3x lower latency
Kimi K2 0905 (moonshotai/kimi-k2-instruct-0905)
  • Moonshot AI’s 1T MoE model
  • 256K context window

Local Providers

Local models run entirely on your device using llama.cpp for maximum privacy. All models are in GGUF format.

Qwen (Alibaba)

id
string
qwen
baseUrl
string
https://huggingface.co
promptTemplate
string
ChatML format: <|im_start|>system\n{system}<|im_end|>\n<|im_start|>user\n{user}<|im_end|>\n<|im_start|>assistant\n

Qwen3 Series (Latest - Thinking Mode Support)

  • Qwen3 8B (Q5) (qwen3-8b-q5_k_m): 5.9GB, higher quality
  • Qwen3 4B (qwen3-4b-q4_k_m): 2.5GB, compact with reasoning
  • Qwen3 1.7B (qwen3-1.7b-q8_0): 1.8GB, small but capable
  • Qwen3 0.6B (qwen3-0.6b-q8_0): 0.6GB, for edge devices
  • Qwen3 32B (qwen3-32b-q4_k_m): 19.8GB, most powerful local model
  • Qwen2.5 7B (qwen2.5-7b-instruct-q4_k_m): 4.7GB, 128K context
  • Qwen2.5 7B (Q5) (qwen2.5-7b-instruct-q5_k_m): 5.4GB, higher quality
  • Qwen2.5 3B (qwen2.5-3b-instruct-q5_k_m): 2.4GB, balanced
  • Qwen2.5 1.5B (qwen2.5-1.5b-instruct-q5_k_m): 1.3GB, basic tasks
  • Qwen2.5 0.5B (qwen2.5-0.5b-instruct-q5_k_m): 0.5GB, fastest

Mistral AI

id
string
mistral
promptTemplate
string
Mistral format: [INST] {system}\n\n{user} [/INST]
Mistral 7B Instruct v0.3 (mistral-7b-instruct-v0.3-q4_k_m) — Recommended
  • Size: 4.4GB
  • Context: 32,768 tokens
  • Fast and efficient instruction model
Mistral 7B Instruct v0.3 (Q5) (mistral-7b-instruct-v0.3-q5_k_m)
  • Size: 5.1GB
  • Higher quality version

Meta Llama

id
string
llama
promptTemplate
string
Llama 3 format: <|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n{system}<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{user}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n
Llama 3.2 3B (llama-3.2-3b-instruct-q4_k_m) — Recommended
  • Size: 2.0GB
  • Context: 131,072 tokens
  • Small but capable multilingual model
Llama 3.2 1B (llama-3.2-1b-instruct-q4_k_m)
  • Size: 0.8GB
  • Tiny model for edge devices
Llama 3.1 8B (llama-3.1-8b-instruct-q4_k_m)
  • Size: 4.9GB
  • Powerful model with great performance

OpenAI OSS

id
string
openai-oss
promptTemplate
string
ChatML format (same as Qwen)
GPT-OSS 20B (gpt-oss-20b-mxfp4) — Recommended
  • Size: 12.1GB
  • Quantization: MXFP4 (4-bit microscaling float)
  • Context: 128,000 tokens
  • OpenAI’s open-weight model for consumer hardware

Gemma (Google)

id
string
gemma
promptTemplate
string
Gemma format: <bos><start_of_turn>user\n{system}\n\n{user}<end_of_turn>\n<start_of_turn>model\n
Gemma 3 4B (gemma-3-4b-it-q4_k_m) — Recommended
  • Size: 2.49GB
  • Context: 131,072 tokens
  • Great balance of speed and quality
Gemma 3 1B (gemma-3-1b-it-q4_k_m)
  • Size: 0.81GB
  • Ultra-fast, best for short dictation cleanup

Using AI Models

Via ReasoningService

import reasoningService from '@/services/ReasoningService';

const result = await reasoningService.processText(
  'Transcribed text here',
  'gpt-5-mini', // model ID
  'Jarvis', // agent name
  {
    systemPrompt: 'Custom system prompt (optional)',
    temperature: 0.3,
    maxTokens: 4096
  }
);

console.log(result); // Processed text

API Call Example (OpenAI)

// Automatically uses Responses API for GPT-5/o-series
const apiKey = await window.electronAPI.getOpenAIKey();

const response = await fetch('https://api.openai.com/v1/responses', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
    'Authorization': `Bearer ${apiKey}`
  },
  body: JSON.stringify({
    model: 'gpt-5-mini',
    input: [
      { role: 'system', content: 'You are a helpful assistant.' },
      { role: 'user', content: 'Clean up this text: ...' }
    ],
    store: false
  })
});

const data = await response.json();
const text = data.output.find(item => 
  item.type === 'message'
)?.content?.find(c => 
  c.type === 'output_text'
)?.text;

Local Model Download

const result = await window.electronAPI.downloadLocalModel(
  'qwen3-8b-q4_k_m',
  (progress) => {
    console.log(`Downloading: ${progress.percentage}%`);
  }
);

if (result.success) {
  console.log(`Model downloaded to: ${result.path}`);
}

Check Local Model Availability

const available = await window.electronAPI.checkLocalReasoningAvailable();

if (available) {
  console.log('llama.cpp server is ready');
} else {
  console.log('Local reasoning unavailable');
}

Model Registry

All models are defined in src/models/modelRegistryData.json as a single source of truth:
{
  "cloudProviders": [
    {
      "id": "openai",
      "name": "OpenAI",
      "models": [
        {
          "id": "gpt-5.2",
          "name": "GPT-5.2",
          "description": "Latest flagship reasoning model"
        }
      ]
    }
  ],
  "localProviders": [
    {
      "id": "qwen",
      "name": "Qwen",
      "models": [
        {
          "id": "qwen3-8b-q4_k_m",
          "name": "Qwen3 8B",
          "size": "5.0GB",
          "hfRepo": "Qwen/Qwen3-8B-GGUF",
          "fileName": "Qwen3-8B-Q4_K_M.gguf",
          "recommended": true
        }
      ]
    }
  ]
}

Model Provider Detection

import { getModelProvider, getCloudModel } from '@/models/ModelRegistry';

const provider = getModelProvider('gpt-5-mini');
console.log(provider); // 'openai'

const model = getCloudModel('claude-sonnet-4-6');
console.log(model); // { id: 'claude-sonnet-4-6', name: 'Claude Sonnet 4.6', ... }

API Key Management

// Get API keys (cached for performance)
const openaiKey = await window.electronAPI.getOpenAIKey();
const anthropicKey = await window.electronAPI.getAnthropicKey();
const geminiKey = await window.electronAPI.getGeminiKey();
const groqKey = await window.electronAPI.getGroqKey();

// Save API keys (automatically persists to .env)
await window.electronAPI.saveOpenAIKey('sk-...');
await window.electronAPI.saveAnthropicKey('sk-ant-...');
await window.electronAPI.saveGeminiKey('AIza...');
await window.electronAPI.saveGroqKey('gsk_...');

// Clear API key cache after updating
reasoningService.clearApiKeyCache('openai');
API keys are stored in environment variables and automatically reloaded on app start. Keys are cached in memory during runtime for better performance.

Custom Reasoning Endpoint

For self-hosted or custom OpenAI-compatible APIs:
import { saveSettings } from '@/stores/settingsStore';

await saveSettings({
  reasoningProvider: 'custom',
  cloudReasoningBaseUrl: 'https://your-api.com/v1',
  customReasoningApiKey: 'your-api-key'
});
Custom endpoints must use HTTPS (HTTP only allowed for local network: localhost, 127.0.0.1, 192.168.*, 10.*).

Thinking Mode

Cloud Models:
  • GPT-5 series (via Responses API)
  • Claude Opus/Sonnet 4.6 (extended thinking)
  • Gemini 3.1 Pro (reasoning mode)
Local Models:
  • Qwen3 series (thinking mode in ChatML format)
  • GPT-OSS 20B (reasoning capabilities)
Disabled for Speed:
  • Groq Qwen models (set reasoning_effort: "none" for faster inference)

Token Limits

// From src/config/constants.ts
const TOKEN_LIMITS = {
  MIN_TOKENS: 512,
  MAX_TOKENS: 8192,
  MIN_TOKENS_GEMINI: 2000, // Gemini 3.1 Pro requires higher minimum
  MAX_TOKENS_GEMINI: 8192,
  TOKEN_MULTIPLIER: 2 // Output tokens = input length * 2
};
The system automatically calculates appropriate max_tokens based on input length.

Build docs developers (and LLMs) love