This guide walks you through building a functional voice AI agent with function calling capabilities. You’ll create an agent that can have natural conversations and fetch real-time weather data.
Install Vision Agents
Install Vision Agents with the required plugins using uv: uv add "vision-agents[getstream, gemini, deepgram, elevenlabs]"
Or if you prefer pip: pip install "vision-agents[getstream, gemini, deepgram, elevenlabs]"
Set up environment variables
Create a .env file with your API credentials: # Stream credentials (get free API key at https://getstream.io)
STREAM_API_KEY = your_stream_api_key
STREAM_API_SECRET = your_stream_api_secret
# LLM provider (choose one)
GOOGLE_API_KEY = your_google_api_key
# Speech-to-text
DEEPGRAM_API_KEY = your_deepgram_api_key
# Text-to-speech
ELEVENLABS_API_KEY = your_elevenlabs_api_key
Stream offers 333,000 free participant minutes per month. Get your API key at getstream.io .
Create your agent
Create a new file my_agent.py: import logging
from typing import Any, Dict
from dotenv import load_dotenv
from vision_agents.core import Agent, AgentLauncher, Runner, User
from vision_agents.plugins import deepgram, elevenlabs, gemini, getstream
load_dotenv()
INSTRUCTIONS = """You're a voice AI assistant. Keep responses short and
conversational. Don't use special characters or formatting. Be friendly
and helpful."""
def setup_llm ( model : str = "gemini-3-flash-preview" ) -> gemini. LLM :
llm = gemini.LLM(model)
@llm.register_function ( description = "Get current weather for a location" )
async def get_weather ( location : str ) -> Dict[ str , Any]:
# In a real app, call a weather API here
return {
"location" : location,
"temperature" : "72°F" ,
"condition" : "Sunny" ,
"humidity" : "45%"
}
return llm
async def create_agent ( ** kwargs ) -> Agent:
llm = setup_llm()
agent = Agent(
edge = getstream.Edge(),
agent_user = User( name = "Weather Assistant" , id = "agent" ),
instructions = INSTRUCTIONS ,
llm = llm,
tts = elevenlabs.TTS( model_id = "eleven_flash_v2_5" ),
stt = deepgram.STT( eager_turn_detection = True ),
)
return agent
async def join_call ( agent : Agent, call_type : str , call_id : str , ** kwargs ) -> None :
call = await agent.create_call(call_type, call_id)
async with agent.join(call):
# Have the agent greet the user
await agent.simple_response( "say hello to the user" )
# Run until the call ends
await agent.finish()
if __name__ == "__main__" :
Runner(AgentLauncher( create_agent = create_agent, join_call = join_call)).cli()
Run your agent
Start your agent with: This will:
Create a new call session
Print a join URL you can use to connect from a browser
Start the agent listening for voice input
The console will display the call URL - open it in your browser to start talking with your agent!
What Just Happened?
Your agent combines four key components:
Edge Network (getstream.Edge()) - Handles ultra-low latency WebRTC connections
Speech-to-Text (deepgram.STT()) - Converts your voice to text
Language Model (gemini.LLM()) - Processes the conversation and decides when to call functions
Text-to-Speech (elevenlabs.TTS()) - Converts the agent’s responses back to voice
Function Calling
The agent can call the get_weather() function automatically when users ask about weather:
@llm.register_function ( description = "Get current weather for a location" )
async def get_weather ( location : str ) -> Dict[ str , Any]:
return { "location" : location, "temperature" : "72°F" , "condition" : "Sunny" }
When someone says “What’s the weather in New York?”, the LLM:
Recognizes it needs weather data
Calls get_weather("New York")
Incorporates the result into its response
Alternative: Using Realtime LLMs
Instead of separate STT, LLM, and TTS components, you can use a realtime LLM that handles everything:
Gemini Realtime
OpenAI Realtime
from vision_agents.plugins import gemini, getstream
agent = Agent(
edge = getstream.Edge(),
agent_user = User( name = "Assistant" , id = "agent" ),
instructions = "You're a helpful assistant" ,
llm = gemini.Realtime( fps = 10 ), # Process 10 video frames per second
)
from vision_agents.plugins import openai, getstream
agent = Agent(
edge = getstream.Edge(),
agent_user = User( name = "Assistant" , id = "agent" ),
instructions = "You're a helpful assistant" ,
llm = openai.Realtime( fps = 1 ), # Video optional, can be audio-only
)
Realtime LLMs with video can be expensive. Start with low FPS (1-10) and monitor your usage.
Adding Video Understanding
To process video in addition to audio, add a video processor:
from vision_agents.plugins import ultralytics
agent = Agent(
edge = getstream.Edge(),
agent_user = User( name = "Vision Assistant" , id = "agent" ),
instructions = "You analyze video and provide insights" ,
llm = gemini.Realtime( fps = 10 ),
processors = [
ultralytics.YOLOPoseProcessor(
model_path = "yolo11n-pose.pt" ,
device = "cuda" # or "cpu"
)
],
)
The processor runs YOLO pose detection on video frames and provides the results to the LLM automatically.
Common Issues
ModuleNotFoundError: No module named 'vision_agents'
Make sure you installed Vision Agents in your current environment:
Check that your .env file is in the same directory as your script and contains the correct keys. Load it with: from dotenv import load_dotenv
load_dotenv()
Connection timeout or network errors
Try different TTS providers or models. ElevenLabs eleven_flash_v2_5 offers good quality with low latency.
Next Steps
Voice Agents Guide Learn advanced voice agent patterns and optimizations
Video Agents Guide Build agents that understand and react to video
Integrations Explore all 37+ LLM, STT, TTS, and vision integrations
Examples See complete example applications