Building Voice Agents

Voice agents combine Speech-to-Text (STT), Large Language Models (LLM), and Text-to-Speech (TTS) to create natural conversational experiences.

Basic Voice Agent

Here’s a complete example of a voice agent optimized for fast response times:

from vision_agents.core import Agent, User
from vision_agents.plugins import deepgram, elevenlabs, gemini, getstream

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="My AI Assistant", id="agent"),
    instructions="You're a voice AI assistant. Keep responses short and conversational.",
    llm=gemini.LLM("gemini-2.5-flash-lite"),
    tts=elevenlabs.TTS(model_id="eleven_flash_v2_5"),
    stt=deepgram.STT(eager_turn_detection=True),
)

Component Selection

Speech-to-Text (STT)

Choose your STT provider based on latency and accuracy needs:

stt = deepgram.STT(
    eager_turn_detection=True  # Lower latency, higher token usage
)

Eager turn detection reduces latency by detecting turns faster, but increases STT token usage. Recommended for conversational agents.

Large Language Model (LLM)

Select an LLM optimized for speed:

llm = gemini.LLM("gemini-2.5-flash-lite")

Text-to-Speech (TTS)

Choose TTS for natural voice output:

tts = elevenlabs.TTS(
    voice_id="FGY2WhTYpPnrIDTdsKH5",
    model_id="eleven_flash_v2_5"  # Fast, low-latency
)

Function Calling

Add tools for the agent to use:

from typing import Dict, Any

llm = gemini.LLM("gemini-2.5-flash-lite")

@llm.register_function(description="Get current weather for a location")
async def get_weather(location: str) -> Dict[str, Any]:
    # Your weather API logic here
    return {
        "location": location,
        "temperature": 72,
        "condition": "sunny"
    }

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Weather Assistant", id="agent"),
    instructions="Help users check weather. Always use the get_weather function.",
    llm=llm,
    tts=elevenlabs.TTS(),
    stt=deepgram.STT(eager_turn_detection=True),
)

Phone Integration

Build voice agents that answer phone calls via Twilio:

from fastapi import FastAPI, WebSocket
from vision_agents.plugins import twilio

app = FastAPI()
call_registry = twilio.TwilioCallRegistry()

@app.post("/twilio/voice")
async def twilio_voice_webhook(
    data: twilio.CallWebhookInput = Depends(twilio.CallWebhookInput.as_form),
):
    call_id = str(uuid.uuid4())
    
    async def prepare_call():
        agent = await create_agent()
        phone_user = User(
            name=f"Call from {data.from_number}",
            id=f"phone-{data.from_number}"
        )
        stream_call = await agent.create_call("default", call_id=call_id)
        return agent, phone_user, stream_call
    
    twilio_call = call_registry.create(call_id, data, prepare=prepare_call)
    url = f"wss://{NGROK_URL}/twilio/media/{call_id}/{twilio_call.token}"
    return twilio.create_media_stream_response(url)

@app.websocket("/twilio/media/{call_id}/{token}")
async def media_stream(websocket: WebSocket, call_id: str, token: str):
    twilio_call = call_registry.validate(call_id, token)
    twilio_stream = twilio.TwilioMediaStream(websocket)
    await twilio_stream.accept()
    
    agent, phone_user, stream_call = await twilio_call.await_prepare()
    await twilio.attach_phone_to_call(stream_call, twilio_stream, phone_user.id)
    
    async with agent.join(stream_call, participant_wait_timeout=0):
        await agent.llm.simple_response(
            "Greet the caller warmly and ask how you can help."
        )
        await twilio_stream.run()

Twilio uses mulaw audio encoding at 8kHz. The framework handles conversion automatically.

Joining a Call

Create and join a call with your agent:

async def join_call(agent: Agent, call_type: str, call_id: str) -> None:
    call = await agent.create_call(call_type, call_id)
    
    async with agent.join(call):
        # Start the conversation
        await agent.simple_response(
            "Hello! How can I help you today?"
        )
        
        # Run until the call ends
        await agent.finish()

Turn Detection

Control when the agent detects the user has finished speaking:

stt = deepgram.STT(
    eager_turn_detection=True  # Uses Deepgram's turn detection
)

Production Best Practices

Optimize for Latency

Use eager_turn_detection=True for STT
Choose fast models: gemini-2.5-flash-lite, gpt-4o-mini
Select low-latency TTS: ElevenLabs Flash, Cartesia Sonic
Deploy edge nodes close to users

Handle Interruptions

# The framework handles barge-in automatically
# Audio queues are flushed on interruption

Error Handling

@agent.llm.events.subscribe
async def on_llm_error(event: LLMErrorEvent):
    logger.error(f"LLM error: {event.error}")
    await agent.simple_response(
        "I'm having trouble right now. Could you repeat that?"
    )

Monitor Performance

See Observability for metrics collection.

Complete Example

See examples/01_simple_agent_example/simple_agent_example.py in the source code for a complete working example.

Next Steps

Add video capabilities: Video Agents
Integrate knowledge bases: RAG Integration
Deploy to production: Deployment

Get Started

Core Concepts

Building Agents

Integrations

Examples

Basic Voice Agent

Component Selection

Speech-to-Text (STT)

Large Language Model (LLM)

Text-to-Speech (TTS)

Function Calling

Phone Integration

Joining a Call

Turn Detection

Production Best Practices

Complete Example

Next Steps

Build docs developers (and LLMs) love

Get Started

Core Concepts

Building Agents

Integrations

Examples

​Basic Voice Agent

​Component Selection

​Speech-to-Text (STT)

​Large Language Model (LLM)

​Text-to-Speech (TTS)

​Function Calling

​Phone Integration

​Joining a Call

​Turn Detection

​Production Best Practices

​Complete Example

​Next Steps

Build docs developers (and LLMs) love

Basic Voice Agent

Component Selection

Speech-to-Text (STT)

Large Language Model (LLM)

Text-to-Speech (TTS)

Function Calling

Phone Integration

Joining a Call

Turn Detection

Production Best Practices

Complete Example

Next Steps