Documentation Index
Fetch the complete documentation index at: https://mintlify.com/TabbyAIKeyboard/tabby/llms.txt
Use this file to discover all available pages before exploring further.
Overview
The/api/voice-agent endpoint initializes real-time voice AI sessions using OpenAI’s Realtime API. It provides a conversational voice assistant with access to memory storage, web search, and Windows MCP tools for desktop automation. The endpoint returns session credentials for establishing a WebSocket connection.
This endpoint creates a session token for OpenAI’s Realtime API. You’ll need to connect to the WebSocket using the returned credentials.
Endpoint
Authentication
Requires authentication via cookies or Authorization header.Requirements
OpenAI API key must be configured in environment variables.Returns
500 Internal Server Error if not set.Request Body
OpenAI Realtime model to use. Defaults to
"gpt-realtime-mini" if not specified.Voice personality for the assistant. Defaults to
"ash" if not specified.Response
Returns a JSON object containing the OpenAI Realtime session credentials:Unique session ID for the Realtime API connection.
The model being used for this session.
Unix timestamp when the session expires.
WebSocket connection credentials.
Available Tools
The voice agent has access to all tools in the system:Memory Tools
Web Search
Windows MCP Tools
Full desktop automation capabilities:- State-Tool: Capture desktop state and interactive elements
- Click-Tool: Click at coordinates
- Type-Tool: Type text at coordinates
- Move-Tool: Move mouse cursor
- Drag-Tool: Drag and drop
- Scroll-Tool: Scroll windows
- Shortcut-Tool: Execute keyboard shortcuts
- App-Tool: Launch/resize/switch applications
- Powershell-Tool: Execute PowerShell commands (preferred for opening apps/files/URLs)
- Wait-Tool: Pause for UI loading
- Scrape-Tool: Fetch content from URLs or browser tabs
Voice-Specific Tools
Additional tools defined inDEFAULT_VOICE_TOOLS for voice interaction features.
Example Request
Example Response
Connecting to the WebSocket
Once you have the session credentials, connect to the Realtime API:Sending Audio Input
Error Responses
Returned when authentication fails.
Returned when OpenAI API key is not configured or OpenAI API request fails.
Features
- Real-time Voice Interaction: Natural conversational voice AI with low latency
- Automatic Transcription: Uses Whisper-1 for input audio transcription
- Server VAD: Server-side Voice Activity Detection for natural turn-taking
- Memory-Powered: Remembers user context and personalizes responses
- Desktop Automation: Full control over Windows desktop via voice commands
- Web Search Integration: Access to current information via Tavily
- Tool Calling: Can invoke multiple tools during conversation
- Conversational Style: Speaks naturally, avoiding markdown and code blocks in speech
Voice Assistant Behavior
The voice agent is optimized for natural conversation:- Searches user memories immediately at conversation start
- Addresses users by name if known
- Stores new personal information aggressively
- Uses conversational language (avoids markdown/bullets in speech)
- Proactively helps with workflow automation
- Remembers daily routines and can execute them on command
Use Cases
- Hands-free Desktop Control: Control Windows applications by voice
- Voice-Activated Workflows: “Prep my workspace” to open all work apps
- Conversational Memory: “What did I work on yesterday?”
- Real-time Information: “What’s in the news about AI today?”
- Personalized Assistant: Learns your preferences and habits over time
- Accessibility: Voice control for users who prefer or require hands-free interaction