Welcome to Vision Agents

Build Real-Time Vision AI Agents

Vision Agents gives you the building blocks to create intelligent, low-latency video experiences powered by your models, your infrastructure, and your use cases. Combine YOLO, Roboflow, and other vision models with Gemini, OpenAI, and Claude in real-time to build the next generation of AI applications.

Quick Start

Get your first agent running in minutes with a complete example

Installation

Install Vision Agents with uv and optional plugin integrations

Voice Agents

Build conversational AI with speech-to-text and text-to-speech

Video Agents

Create vision AI that processes and understands video in real-time

Key Highlights

Video AI

Built for real-time video AI. Combine YOLO, Roboflow, and others with Gemini/OpenAI in real-time.

Low Latency

Join quickly (500ms) and maintain audio/video latency under 30ms using Stream’s edge network.

Open Platform

Built by Stream, but works with any video edge network.

Native APIs

Native SDK methods from OpenAI, Gemini, and Claude — always access the latest LLM capabilities.

Core Features

Feature	Description
True real-time via WebRTC	Stream directly to model providers that support it for instant visual understanding
Interval/processor pipeline	For providers without WebRTC, process frames with pluggable video processors (e.g., YOLO, Roboflow, or custom PyTorch/ONNX) before/after model calls
Turn detection & diarization	Keep conversations natural; know when the agent should speak or stay quiet and who’s talking
Voice activity detection (VAD)	Trigger actions intelligently and use resources efficiently
Speech↔Text↔Speech	Enable low-latency loops for smooth, conversational voice UX
Tool/function calling	Execute arbitrary code and APIs mid-conversation. Create Linear issues, query weather, trigger telephony, or hit internal services
Built-in memory via Stream Chat	Agents recall context naturally across turns and sessions
Text back-channel	Message the agent silently during a call
Phone and RAG	Interact with the Agent via inbound or outbound phone calls using Twilio and Turbopuffer

Quick Example

Here’s a simple voice AI agent that can have conversations and call functions:

from vision_agents.core import Agent, User
from vision_agents.plugins import getstream, gemini, elevenlabs, deepgram

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="AI Assistant", id="agent"),
    instructions="You're a helpful voice AI assistant. Keep responses short and conversational.",
    llm=gemini.LLM(model="gemini-3-flash-preview"),
    tts=elevenlabs.TTS(model_id="eleven_flash_v2_5"),
    stt=deepgram.STT(eager_turn_detection=True),
)

This example uses separate STT, LLM, and TTS components. You can simplify this by using realtime LLMs like gemini.Realtime() or openai.Realtime() that handle speech natively.

Use Cases

Vision Agents is perfect for building:

Sports Coaching - Analyze golf swings, basketball shots, or yoga poses with YOLO and provide real-time feedback
Security Systems - Detect faces, track packages, and respond to theft with automated alerts
Healthcare - Monitor physical therapy sessions, track patient movements, or provide workout guidance
Education - Interactive tutoring with visual understanding and voice interaction
Customer Support - Video-enabled support agents that can see and help with user issues
Gaming - Just Dance-style games or interactive experiences with pose detection

Getting Started

Install Vision Agents

Install the core package with uv:

uv add vision-agents

Add Integrations

Install the plugins you need:

uv add "vision-agents[getstream,openai,elevenlabs,deepgram]"

Get API Keys

Sign up for Stream to get free API credentials. Developers receive 333,000 participant minutes per month, plus extra credits via the Maker Program.

Build Your First Agent

Follow the quickstart guide to create your first voice or video AI agent.

Available Integrations

Vision Agents comes with 35+ out-of-the-box integrations:

LLM Providers

OpenAI - Realtime API with video support, LLMs, and TTS
Gemini - Realtime API, Gemini Live, and VLM interface
Anthropic Claude - Advanced reasoning with vision capabilities
AWS Bedrock - Amazon Nova models with realtime speech-to-speech
Qwen - Alibaba’s Qwen3 with native audio output
xAI Grok - Advanced reasoning and real-time knowledge
OpenRouter - Access multiple providers through unified API
Hugging Face - Open-source models via Cerebras, Together, Groq

Speech-to-Text

Deepgram - Fast, accurate transcription with speaker diarization
Fast-Whisper - OpenAI’s Whisper with CTranslate2 acceleration
Fish Audio - Automatic language detection
Wizper - Real-time translation with Whisper v3
Mistral Voxtral - Real-time transcription with diarization

Text-to-Speech

ElevenLabs - Highly realistic and expressive voices
Cartesia - Realistic voice synthesis for real-time apps
AWS Polly - Natural-sounding voices with neural engine
Fish Audio - Voice cloning capabilities
Inworld - High-quality streaming voices
Kokoro - Local TTS for offline synthesis

Vision & Video Processing

Ultralytics YOLO - Real-time pose and object detection
Roboflow - Object detection with hosted API or local models
Moondream - Lightweight VLM for detection, caption, and VQA
NVIDIA Cosmos 2 - Video understanding with frame buffering
Decart - Real-time video transformation and styling

Specialized Services

HeyGen - Real-time interactive avatars
Twilio - Voice call integration with Media Streams
TurboPuffer - RAG with hybrid search (vector + BM25)
Smart Turn - Advanced turn detection with Silero VAD
Vogent - Neural turn detection for conversations

Explore all integrations in the Integrations section.

Next Steps

Quickstart Tutorial

Build your first agent in 5 minutes

Core Concepts

Understand agents, processors, and the architecture

Browse Examples

Explore real-world examples and use cases

API Reference

Dive into the complete API documentation

Community & Support

Join our community to get help, share ideas, and stay updated:

Discord - Join the conversation
GitHub - Star the repo and contribute
Documentation - Browse guides and tutorials at visionagents.ai

Vision AI LimitationsVideo AI is at the frontier of AI. Keep these limitations in mind:

Struggles with small text (e.g., reading game scores)
Can lose context with longer videos (30+ seconds)
Most applications need a combination of specialized models (YOLO, Roboflow) + larger models (Gemini, OpenAI)
Image size & FPS must stay relatively low for performance
Video doesn’t trigger responses in realtime models - you need audio/text input

Get Started

Core Concepts

Building Agents

Integrations

Examples

Welcome to Vision Agents

Build Real-Time Vision AI Agents

Quick Start

Installation

Voice Agents

Video Agents

Key Highlights

Video AI

Low Latency

Open Platform

Native APIs

Core Features

Quick Example

Use Cases

Getting Started

Available Integrations

Next Steps

Quickstart Tutorial

Core Concepts

Browse Examples

API Reference

Community & Support

Build docs developers (and LLMs) love

Get Started

Core Concepts

Building Agents

Integrations

Examples

​Build Real-Time Vision AI Agents

​Multi-modal AI agents that watch, listen, and understand video

Quick Start

Installation

Voice Agents

Video Agents

​Key Highlights

Video AI

Low Latency

Open Platform

Native APIs

​Core Features

​Quick Example

​Use Cases

​Getting Started

​Available Integrations

​Next Steps

Quickstart Tutorial

Core Concepts

Browse Examples

API Reference

​Community & Support

Build docs developers (and LLMs) love

Build Real-Time Vision AI Agents

Multi-modal AI agents that watch, listen, and understand video

Key Highlights

Core Features

Quick Example

Use Cases

Getting Started

Available Integrations

Next Steps

Community & Support