Quickstart

This guide walks you through building a functional voice AI agent with function calling capabilities. You’ll create an agent that can have natural conversations and fetch real-time weather data.

Install Vision Agents

Install Vision Agents with the required plugins using uv:

uv add "vision-agents[getstream, gemini, deepgram, elevenlabs]"

Or if you prefer pip:

pip install "vision-agents[getstream, gemini, deepgram, elevenlabs]"

Set up environment variables

Create a .env file with your API credentials:

# Stream credentials (get free API key at https://getstream.io)
STREAM_API_KEY=your_stream_api_key
STREAM_API_SECRET=your_stream_api_secret

# LLM provider (choose one)
GOOGLE_API_KEY=your_google_api_key

# Speech-to-text
DEEPGRAM_API_KEY=your_deepgram_api_key

# Text-to-speech
ELEVENLABS_API_KEY=your_elevenlabs_api_key

Stream offers 333,000 free participant minutes per month. Get your API key at getstream.io.

Create your agent

Create a new file my_agent.py:

import logging
from typing import Any, Dict

from dotenv import load_dotenv
from vision_agents.core import Agent, AgentLauncher, Runner, User
from vision_agents.plugins import deepgram, elevenlabs, gemini, getstream

load_dotenv()

INSTRUCTIONS = """You're a voice AI assistant. Keep responses short and 
conversational. Don't use special characters or formatting. Be friendly 
and helpful."""

def setup_llm(model: str = "gemini-3-flash-preview") -> gemini.LLM:
    llm = gemini.LLM(model)

    @llm.register_function(description="Get current weather for a location")
    async def get_weather(location: str) -> Dict[str, Any]:
        # In a real app, call a weather API here
        return {
            "location": location,
            "temperature": "72°F",
            "condition": "Sunny",
            "humidity": "45%"
        }

    return llm

async def create_agent(**kwargs) -> Agent:
    llm = setup_llm()

    agent = Agent(
        edge=getstream.Edge(),
        agent_user=User(name="Weather Assistant", id="agent"),
        instructions=INSTRUCTIONS,
        llm=llm,
        tts=elevenlabs.TTS(model_id="eleven_flash_v2_5"),
        stt=deepgram.STT(eager_turn_detection=True),
    )

    return agent

async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
    call = await agent.create_call(call_type, call_id)

    async with agent.join(call):
        # Have the agent greet the user
        await agent.simple_response("say hello to the user")
        
        # Run until the call ends
        await agent.finish()

if __name__ == "__main__":
    Runner(AgentLauncher(create_agent=create_agent, join_call=join_call)).cli()

Source: examples/01_simple_agent_example

Run your agent

Start your agent with:

uv run my_agent.py run

This will:

Create a new call session
Print a join URL you can use to connect from a browser
Start the agent listening for voice input

The console will display the call URL - open it in your browser to start talking with your agent!

What Just Happened?

Your agent combines four key components:

Edge Network (getstream.Edge()) - Handles ultra-low latency WebRTC connections
Speech-to-Text (deepgram.STT()) - Converts your voice to text
Language Model (gemini.LLM()) - Processes the conversation and decides when to call functions
Text-to-Speech (elevenlabs.TTS()) - Converts the agent’s responses back to voice

Function Calling

The agent can call the get_weather() function automatically when users ask about weather:

@llm.register_function(description="Get current weather for a location")
async def get_weather(location: str) -> Dict[str, Any]:
    return {"location": location, "temperature": "72°F", "condition": "Sunny"}

When someone says “What’s the weather in New York?”, the LLM:

Recognizes it needs weather data
Calls get_weather("New York")
Incorporates the result into its response

Alternative: Using Realtime LLMs

Instead of separate STT, LLM, and TTS components, you can use a realtime LLM that handles everything:

Gemini Realtime
OpenAI Realtime

from vision_agents.plugins import gemini, getstream

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Assistant", id="agent"),
    instructions="You're a helpful assistant",
    llm=gemini.Realtime(fps=10),  # Process 10 video frames per second
)

from vision_agents.plugins import openai, getstream

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Assistant", id="agent"),
    instructions="You're a helpful assistant",
    llm=openai.Realtime(fps=1),  # Video optional, can be audio-only
)

Realtime LLMs with video can be expensive. Start with low FPS (1-10) and monitor your usage.

Adding Video Understanding

To process video in addition to audio, add a video processor:

from vision_agents.plugins import ultralytics

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Vision Assistant", id="agent"),
    instructions="You analyze video and provide insights",
    llm=gemini.Realtime(fps=10),
    processors=[
        ultralytics.YOLOPoseProcessor(
            model_path="yolo11n-pose.pt",
            device="cuda"  # or "cpu"
        )
    ],
)

The processor runs YOLO pose detection on video frames and provides the results to the LLM automatically.

Common Issues

ModuleNotFoundError: No module named 'vision_agents'

Make sure you installed Vision Agents in your current environment:

uv add vision-agents

API key not found errors

Check that your .env file is in the same directory as your script and contains the correct keys. Load it with:

from dotenv import load_dotenv
load_dotenv()

Connection timeout or network errors

Ensure your Stream API credentials are correct. Test them at getstream.io/dashboard.

Audio quality issues

Try different TTS providers or models. ElevenLabs eleven_flash_v2_5 offers good quality with low latency.

Next Steps

Voice Agents Guide

Learn advanced voice agent patterns and optimizations

Video Agents Guide

Build agents that understand and react to video

Integrations

Explore all 37+ LLM, STT, TTS, and vision integrations

Examples

See complete example applications

Get Started

Core Concepts

Building Agents

Integrations

Examples

What Just Happened?

Function Calling

Alternative: Using Realtime LLMs

Adding Video Understanding

Common Issues

Next Steps

Voice Agents Guide

Video Agents Guide

Integrations

Examples

Build docs developers (and LLMs) love

Get Started

Core Concepts

Building Agents

Integrations

Examples

​What Just Happened?

​Function Calling

​Alternative: Using Realtime LLMs

​Adding Video Understanding

​Common Issues

​Next Steps

Voice Agents Guide

Video Agents Guide

Integrations

Examples

Build docs developers (and LLMs) love

What Just Happened?

Function Calling

Alternative: Using Realtime LLMs

Adding Video Understanding

Common Issues

Next Steps