Skip to main content
This guide walks you through building a functional voice AI agent with function calling capabilities. You’ll create an agent that can have natural conversations and fetch real-time weather data.
1

Install Vision Agents

Install Vision Agents with the required plugins using uv:
uv add "vision-agents[getstream, gemini, deepgram, elevenlabs]"
Or if you prefer pip:
pip install "vision-agents[getstream, gemini, deepgram, elevenlabs]"
2

Set up environment variables

Create a .env file with your API credentials:
# Stream credentials (get free API key at https://getstream.io)
STREAM_API_KEY=your_stream_api_key
STREAM_API_SECRET=your_stream_api_secret

# LLM provider (choose one)
GOOGLE_API_KEY=your_google_api_key

# Speech-to-text
DEEPGRAM_API_KEY=your_deepgram_api_key

# Text-to-speech
ELEVENLABS_API_KEY=your_elevenlabs_api_key
Stream offers 333,000 free participant minutes per month. Get your API key at getstream.io.
3

Create your agent

Create a new file my_agent.py:
import logging
from typing import Any, Dict

from dotenv import load_dotenv
from vision_agents.core import Agent, AgentLauncher, Runner, User
from vision_agents.plugins import deepgram, elevenlabs, gemini, getstream

load_dotenv()

INSTRUCTIONS = """You're a voice AI assistant. Keep responses short and 
conversational. Don't use special characters or formatting. Be friendly 
and helpful."""

def setup_llm(model: str = "gemini-3-flash-preview") -> gemini.LLM:
    llm = gemini.LLM(model)

    @llm.register_function(description="Get current weather for a location")
    async def get_weather(location: str) -> Dict[str, Any]:
        # In a real app, call a weather API here
        return {
            "location": location,
            "temperature": "72°F",
            "condition": "Sunny",
            "humidity": "45%"
        }

    return llm

async def create_agent(**kwargs) -> Agent:
    llm = setup_llm()

    agent = Agent(
        edge=getstream.Edge(),
        agent_user=User(name="Weather Assistant", id="agent"),
        instructions=INSTRUCTIONS,
        llm=llm,
        tts=elevenlabs.TTS(model_id="eleven_flash_v2_5"),
        stt=deepgram.STT(eager_turn_detection=True),
    )

    return agent

async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
    call = await agent.create_call(call_type, call_id)

    async with agent.join(call):
        # Have the agent greet the user
        await agent.simple_response("say hello to the user")
        
        # Run until the call ends
        await agent.finish()

if __name__ == "__main__":
    Runner(AgentLauncher(create_agent=create_agent, join_call=join_call)).cli()
4

Run your agent

Start your agent with:
uv run my_agent.py run
This will:
  • Create a new call session
  • Print a join URL you can use to connect from a browser
  • Start the agent listening for voice input
The console will display the call URL - open it in your browser to start talking with your agent!

What Just Happened?

Your agent combines four key components:
  1. Edge Network (getstream.Edge()) - Handles ultra-low latency WebRTC connections
  2. Speech-to-Text (deepgram.STT()) - Converts your voice to text
  3. Language Model (gemini.LLM()) - Processes the conversation and decides when to call functions
  4. Text-to-Speech (elevenlabs.TTS()) - Converts the agent’s responses back to voice

Function Calling

The agent can call the get_weather() function automatically when users ask about weather:
@llm.register_function(description="Get current weather for a location")
async def get_weather(location: str) -> Dict[str, Any]:
    return {"location": location, "temperature": "72°F", "condition": "Sunny"}
When someone says “What’s the weather in New York?”, the LLM:
  1. Recognizes it needs weather data
  2. Calls get_weather("New York")
  3. Incorporates the result into its response

Alternative: Using Realtime LLMs

Instead of separate STT, LLM, and TTS components, you can use a realtime LLM that handles everything:
from vision_agents.plugins import gemini, getstream

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Assistant", id="agent"),
    instructions="You're a helpful assistant",
    llm=gemini.Realtime(fps=10),  # Process 10 video frames per second
)
Realtime LLMs with video can be expensive. Start with low FPS (1-10) and monitor your usage.

Adding Video Understanding

To process video in addition to audio, add a video processor:
from vision_agents.plugins import ultralytics

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Vision Assistant", id="agent"),
    instructions="You analyze video and provide insights",
    llm=gemini.Realtime(fps=10),
    processors=[
        ultralytics.YOLOPoseProcessor(
            model_path="yolo11n-pose.pt",
            device="cuda"  # or "cpu"
        )
    ],
)
The processor runs YOLO pose detection on video frames and provides the results to the LLM automatically.

Common Issues

Make sure you installed Vision Agents in your current environment:
uv add vision-agents
Check that your .env file is in the same directory as your script and contains the correct keys. Load it with:
from dotenv import load_dotenv
load_dotenv()
Ensure your Stream API credentials are correct. Test them at getstream.io/dashboard.
Try different TTS providers or models. ElevenLabs eleven_flash_v2_5 offers good quality with low latency.

Next Steps

Voice Agents Guide

Learn advanced voice agent patterns and optimizations

Video Agents Guide

Build agents that understand and react to video

Integrations

Explore all 37+ LLM, STT, TTS, and vision integrations

Examples

See complete example applications

Build docs developers (and LLMs) love