Documentation Index Fetch the complete documentation index at: https://mintlify.com/Bijit-Mondal/VoiceAgent/llms.txt
Use this file to discover all available pages before exploring further.
Build your first voice-enabled AI agent with streaming text generation, real-time speech synthesis, and tool calling capabilities.
What You’ll Build
By the end of this guide, you’ll have a working voice agent that:
Processes text input and streams responses
Calls tools (like weather lookup) during conversations
Generates streaming audio responses
Supports WebSocket connections for real-time voice interaction
Install the SDK
Install voice-agent-ai-sdk and the AI SDK with your preferred provider: npm install voice-agent-ai-sdk ai @ai-sdk/openai
The SDK is built on the Vercel AI SDK , giving you access to multiple LLM providers, tools, and streaming capabilities.
Set up environment variables
Create a .env file in your project root with your API keys: OPENAI_API_KEY = your_openai_api_key
VOICE_WS_ENDPOINT = ws://localhost:8080 # Optional for WebSocket mode
The VOICE_WS_ENDPOINT is only needed if you want real-time voice interaction over WebSocket. For text-only usage, you can skip it.
Define tools for your agent
Tools allow your agent to fetch real-time data or perform actions. Define them using the AI SDK’s tool function: import { tool } from "ai" ;
import { z } from "zod" ;
const weatherTool = tool ({
description: "Get the weather in a location" ,
inputSchema: z . object ({
location: z . string (). describe ( "The location to get the weather for" ),
}),
execute : async ({ location }) => ({
location ,
temperature: 72 ,
conditions: "sunny" ,
}),
});
const timeTool = tool ({
description: "Get the current time" ,
inputSchema: z . object ({}),
execute : async () => ({
time: new Date (). toLocaleTimeString (),
timezone: Intl . DateTimeFormat (). resolvedOptions (). timeZone ,
}),
});
These tools will be automatically called when the LLM determines they’re needed to answer the user’s query.
Initialize the VoiceAgent
Create a new VoiceAgent instance with your desired configuration: import "dotenv/config" ;
import { VoiceAgent } from "voice-agent-ai-sdk" ;
import { openai } from "@ai-sdk/openai" ;
const agent = new VoiceAgent ({
// Core models
model: openai ( "gpt-4o" ),
transcriptionModel: openai . transcription ( "whisper-1" ),
speechModel: openai . speech ( "gpt-4o-mini-tts" ),
// System instructions
instructions: `You are a helpful voice assistant.
Keep responses concise and conversational since they will be spoken aloud.
Use tools when needed to provide accurate information.` ,
// Voice settings
voice: "alloy" , // Options: alloy, echo, fable, onyx, nova, shimmer
speechInstructions: "Speak in a friendly, natural conversational tone." ,
outputFormat: "mp3" ,
// Streaming speech optimization
streamingSpeech: {
minChunkSize: 40 ,
maxChunkSize: 180 ,
parallelGeneration: true ,
maxParallelRequests: 2 ,
},
// WebSocket endpoint (optional)
endpoint: process . env . VOICE_WS_ENDPOINT ,
// Register your tools
tools: {
getWeather: weatherTool ,
getTime: timeTool ,
},
});
The agent handles the entire voice interaction lifecycle: text streaming, tool calling, and speech synthesis.
Set up event listeners
Listen to events to track the agent’s activity and handle responses: // User input and assistant responses
agent . on ( "text" , ({ role , text }) => {
const prefix = role === "user" ? "👤 User" : "🤖 Assistant" ;
console . log ( ` ${ prefix } : ${ text } ` );
});
// Real-time streaming text tokens
agent . on ( "chunk:text_delta" , ({ text }) => {
process . stdout . write ( text );
});
// Tool execution events
agent . on ( "chunk:tool_call" , ({ toolName , input }) => {
console . log ( ` \n [Tool] Calling ${ toolName } ...` , JSON . stringify ( input ));
});
agent . on ( "tool_result" , ({ name , result }) => {
console . log ( `[Tool] ${ name } result:` , JSON . stringify ( result ));
});
// Speech generation events
agent . on ( "speech_start" , ({ streaming }) => {
console . log ( `[TTS] Speech started (streaming= ${ streaming } )` );
});
agent . on ( "audio_chunk" , ({ chunkId , format , uint8Array }) => {
console . log ( `[Audio] Chunk # ${ chunkId } ( ${ uint8Array . length } bytes, ${ format } )` );
// Save or stream the audio chunk
});
agent . on ( "speech_complete" , () => {
console . log ( "[TTS] Speech generation complete" );
});
// Error handling
agent . on ( "error" , ( error ) => {
console . error ( "[Error]" , error . message );
});
The SDK emits events at every stage of processing, giving you full visibility into the agent’s behavior.
Send your first message
Send a text message to the agent and get a streaming response: try {
const response = await agent . sendText ( "What's the weather in San Francisco?" );
console . log ( "Full response:" , response );
} catch ( error ) {
console . error ( "Error:" , error );
}
The agent will:
Add your message to the conversation history
Stream text tokens in real-time via chunk:text_delta events
Detect that it needs weather data and call the getWeather tool
Generate a response incorporating the tool result
Convert the response to speech in parallel chunks
Emit audio chunks as they’re generated
Optional: Connect to WebSocket for real-time voice
For real-time voice interaction, connect to a WebSocket server: if ( process . env . VOICE_WS_ENDPOINT ) {
await agent . connect ( process . env . VOICE_WS_ENDPOINT );
console . log ( "Agent connected and listening for audio input" );
// The agent will now listen for WebSocket messages like:
// { type: "transcript", text: "user speech text" }
// { type: "audio", data: "base64AudioData", format: "mp3" }
}
The WebSocket protocol supports:
Text transcripts from browser speech recognition
Audio data for server-side transcription with Whisper
Interruptions to cancel ongoing responses (barge-in)
Expected Output
When you run the code above, you’ll see output like this:
=== Voice Agent Demo ===
Testing text-only mode (no WebSocket required)
--- Test 1: Text Query ---
👤 User: What's the weather in San Francisco?
[Tool] Calling getWeather... {"location":"San Francisco"}
[Tool] getWeather result: {"location":"San Francisco","temperature":72,"conditions":"sunny"}
🤖 Assistant: The weather in San Francisco is currently sunny with a temperature of 72°F.
[TTS] Speech started (streaming=true)
[TTS] Queued chunk #0: The weather in San Francisco is currently...
[Audio] Chunk #0 (24576 bytes, mp3)
[TTS] Speech generation complete
Complete Example
Here’s the full working example you can copy and run:
import "dotenv/config" ;
import { VoiceAgent } from "voice-agent-ai-sdk" ;
import { tool } from "ai" ;
import { z } from "zod" ;
import { openai } from "@ai-sdk/openai" ;
// Define tools
const weatherTool = tool ({
description: "Get the weather in a location" ,
inputSchema: z . object ({
location: z . string (). describe ( "The location to get the weather for" ),
}),
execute : async ({ location }) => ({
location ,
temperature: 72 ,
conditions: "sunny" ,
}),
});
// Initialize agent
const agent = new VoiceAgent ({
model: openai ( "gpt-4o" ),
transcriptionModel: openai . transcription ( "whisper-1" ),
speechModel: openai . speech ( "gpt-4o-mini-tts" ),
instructions: "You are a helpful voice assistant." ,
voice: "alloy" ,
speechInstructions: "Speak in a friendly, natural conversational tone." ,
outputFormat: "mp3" ,
streamingSpeech: {
minChunkSize: 40 ,
maxChunkSize: 180 ,
parallelGeneration: true ,
maxParallelRequests: 2 ,
},
endpoint: process . env . VOICE_WS_ENDPOINT ,
tools: { getWeather: weatherTool },
});
// Set up event listeners
agent . on ( "text" , ({ role , text }) => {
const prefix = role === "user" ? "👤" : "🤖" ;
console . log ( prefix , text );
});
agent . on ( "chunk:text_delta" , ({ text }) => process . stdout . write ( text ));
agent . on ( "audio_chunk" , ({ chunkId , format , uint8Array }) => {
console . log ( `Audio chunk ${ chunkId } : ${ uint8Array . length } bytes` );
});
// Send message
await agent . sendText ( "What's the weather in San Francisco?" );
// Optional: connect to WebSocket
if ( process . env . VOICE_WS_ENDPOINT ) {
await agent . connect ( process . env . VOICE_WS_ENDPOINT );
}
Next Steps
Now that you have a working voice agent, explore more advanced features:
Configuration Guide Fine-tune streaming speech, memory limits, and audio settings
Events Reference Complete list of all events and their payloads
VoiceAgent API Full API reference for methods and properties
Examples More examples including WebSocket servers and browser clients