Voice agents combine Speech-to-Text (STT), Large Language Models (LLM), and Text-to-Speech (TTS) to create natural conversational experiences.
Basic Voice Agent
Here’s a complete example of a voice agent optimized for fast response times:
from vision_agents.core import Agent, User
from vision_agents.plugins import deepgram, elevenlabs, gemini, getstream
agent = Agent(
edge = getstream.Edge(),
agent_user = User( name = "My AI Assistant" , id = "agent" ),
instructions = "You're a voice AI assistant. Keep responses short and conversational." ,
llm = gemini.LLM( "gemini-2.5-flash-lite" ),
tts = elevenlabs.TTS( model_id = "eleven_flash_v2_5" ),
stt = deepgram.STT( eager_turn_detection = True ),
)
Component Selection
Speech-to-Text (STT)
Choose your STT provider based on latency and accuracy needs:
Deepgram (Recommended)
Assembly AI
stt = deepgram.STT(
eager_turn_detection = True # Lower latency, higher token usage
)
Eager turn detection reduces latency by detecting turns faster, but increases STT token usage. Recommended for conversational agents.
Large Language Model (LLM)
Select an LLM optimized for speed:
Gemini Flash
OpenAI GPT-4o Mini
Anthropic Claude
llm = gemini.LLM( "gemini-2.5-flash-lite" )
Text-to-Speech (TTS)
Choose TTS for natural voice output:
tts = elevenlabs.TTS(
voice_id = "FGY2WhTYpPnrIDTdsKH5" ,
model_id = "eleven_flash_v2_5" # Fast, low-latency
)
Function Calling
Add tools for the agent to use:
from typing import Dict, Any
llm = gemini.LLM( "gemini-2.5-flash-lite" )
@llm.register_function ( description = "Get current weather for a location" )
async def get_weather ( location : str ) -> Dict[ str , Any]:
# Your weather API logic here
return {
"location" : location,
"temperature" : 72 ,
"condition" : "sunny"
}
agent = Agent(
edge = getstream.Edge(),
agent_user = User( name = "Weather Assistant" , id = "agent" ),
instructions = "Help users check weather. Always use the get_weather function." ,
llm = llm,
tts = elevenlabs.TTS(),
stt = deepgram.STT( eager_turn_detection = True ),
)
Phone Integration
Build voice agents that answer phone calls via Twilio:
from fastapi import FastAPI, WebSocket
from vision_agents.plugins import twilio
app = FastAPI()
call_registry = twilio.TwilioCallRegistry()
@app.post ( "/twilio/voice" )
async def twilio_voice_webhook (
data : twilio.CallWebhookInput = Depends(twilio.CallWebhookInput.as_form),
):
call_id = str (uuid.uuid4())
async def prepare_call ():
agent = await create_agent()
phone_user = User(
name = f "Call from { data.from_number } " ,
id = f "phone- { data.from_number } "
)
stream_call = await agent.create_call( "default" , call_id = call_id)
return agent, phone_user, stream_call
twilio_call = call_registry.create(call_id, data, prepare = prepare_call)
url = f "wss:// { NGROK_URL } /twilio/media/ { call_id } / { twilio_call.token } "
return twilio.create_media_stream_response(url)
@app.websocket ( "/twilio/media/ {call_id} / {token} " )
async def media_stream ( websocket : WebSocket, call_id : str , token : str ):
twilio_call = call_registry.validate(call_id, token)
twilio_stream = twilio.TwilioMediaStream(websocket)
await twilio_stream.accept()
agent, phone_user, stream_call = await twilio_call.await_prepare()
await twilio.attach_phone_to_call(stream_call, twilio_stream, phone_user.id)
async with agent.join(stream_call, participant_wait_timeout = 0 ):
await agent.llm.simple_response(
"Greet the caller warmly and ask how you can help."
)
await twilio_stream.run()
Twilio uses mulaw audio encoding at 8kHz. The framework handles conversion automatically.
Joining a Call
Create and join a call with your agent:
async def join_call ( agent : Agent, call_type : str , call_id : str ) -> None :
call = await agent.create_call(call_type, call_id)
async with agent.join(call):
# Start the conversation
await agent.simple_response(
"Hello! How can I help you today?"
)
# Run until the call ends
await agent.finish()
Turn Detection
Control when the agent detects the user has finished speaking:
Deepgram Built-in
Vogent Turn Detection
stt = deepgram.STT(
eager_turn_detection = True # Uses Deepgram's turn detection
)
Production Best Practices
Optimize for Latency
Use eager_turn_detection=True for STT
Choose fast models: gemini-2.5-flash-lite, gpt-4o-mini
Select low-latency TTS: ElevenLabs Flash, Cartesia Sonic
Deploy edge nodes close to users
Handle Interruptions
# The framework handles barge-in automatically
# Audio queues are flushed on interruption
Error Handling
@agent.llm.events.subscribe
async def on_llm_error ( event : LLMErrorEvent):
logger.error( f "LLM error: { event.error } " )
await agent.simple_response(
"I'm having trouble right now. Could you repeat that?"
)
Complete Example
See examples/01_simple_agent_example/simple_agent_example.py in the source code for a complete working example.
Next Steps