Skip to main content

Build Real-Time Vision AI Agents

Vision Agents gives you the building blocks to create intelligent, low-latency video experiences powered by your models, your infrastructure, and your use cases.

Multi-modal AI agents that watch, listen, and understand video

Combine YOLO, Roboflow, and other vision models with Gemini, OpenAI, and Claude in real-time to build the next generation of AI applications.

Quick Start

Get your first agent running in minutes with a complete example

Installation

Install Vision Agents with uv and optional plugin integrations

Voice Agents

Build conversational AI with speech-to-text and text-to-speech

Video Agents

Create vision AI that processes and understands video in real-time

Key Highlights

Video AI

Built for real-time video AI. Combine YOLO, Roboflow, and others with Gemini/OpenAI in real-time.

Low Latency

Join quickly (500ms) and maintain audio/video latency under 30ms using Stream’s edge network.

Open Platform

Built by Stream, but works with any video edge network.

Native APIs

Native SDK methods from OpenAI, Gemini, and Claude — always access the latest LLM capabilities.

Core Features

FeatureDescription
True real-time via WebRTCStream directly to model providers that support it for instant visual understanding
Interval/processor pipelineFor providers without WebRTC, process frames with pluggable video processors (e.g., YOLO, Roboflow, or custom PyTorch/ONNX) before/after model calls
Turn detection & diarizationKeep conversations natural; know when the agent should speak or stay quiet and who’s talking
Voice activity detection (VAD)Trigger actions intelligently and use resources efficiently
Speech↔Text↔SpeechEnable low-latency loops for smooth, conversational voice UX
Tool/function callingExecute arbitrary code and APIs mid-conversation. Create Linear issues, query weather, trigger telephony, or hit internal services
Built-in memory via Stream ChatAgents recall context naturally across turns and sessions
Text back-channelMessage the agent silently during a call
Phone and RAGInteract with the Agent via inbound or outbound phone calls using Twilio and Turbopuffer

Quick Example

Here’s a simple voice AI agent that can have conversations and call functions:
from vision_agents.core import Agent, User
from vision_agents.plugins import getstream, gemini, elevenlabs, deepgram

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="AI Assistant", id="agent"),
    instructions="You're a helpful voice AI assistant. Keep responses short and conversational.",
    llm=gemini.LLM(model="gemini-3-flash-preview"),
    tts=elevenlabs.TTS(model_id="eleven_flash_v2_5"),
    stt=deepgram.STT(eager_turn_detection=True),
)
This example uses separate STT, LLM, and TTS components. You can simplify this by using realtime LLMs like gemini.Realtime() or openai.Realtime() that handle speech natively.

Use Cases

Vision Agents is perfect for building:
  • Sports Coaching - Analyze golf swings, basketball shots, or yoga poses with YOLO and provide real-time feedback
  • Security Systems - Detect faces, track packages, and respond to theft with automated alerts
  • Healthcare - Monitor physical therapy sessions, track patient movements, or provide workout guidance
  • Education - Interactive tutoring with visual understanding and voice interaction
  • Customer Support - Video-enabled support agents that can see and help with user issues
  • Gaming - Just Dance-style games or interactive experiences with pose detection

Getting Started

1

Install Vision Agents

Install the core package with uv:
uv add vision-agents
2

Add Integrations

Install the plugins you need:
uv add "vision-agents[getstream,openai,elevenlabs,deepgram]"
3

Get API Keys

Sign up for Stream to get free API credentials. Developers receive 333,000 participant minutes per month, plus extra credits via the Maker Program.
4

Build Your First Agent

Follow the quickstart guide to create your first voice or video AI agent.

Available Integrations

Vision Agents comes with 35+ out-of-the-box integrations:
  • OpenAI - Realtime API with video support, LLMs, and TTS
  • Gemini - Realtime API, Gemini Live, and VLM interface
  • Anthropic Claude - Advanced reasoning with vision capabilities
  • AWS Bedrock - Amazon Nova models with realtime speech-to-speech
  • Qwen - Alibaba’s Qwen3 with native audio output
  • xAI Grok - Advanced reasoning and real-time knowledge
  • OpenRouter - Access multiple providers through unified API
  • Hugging Face - Open-source models via Cerebras, Together, Groq
  • Deepgram - Fast, accurate transcription with speaker diarization
  • Fast-Whisper - OpenAI’s Whisper with CTranslate2 acceleration
  • Fish Audio - Automatic language detection
  • Wizper - Real-time translation with Whisper v3
  • Mistral Voxtral - Real-time transcription with diarization
  • ElevenLabs - Highly realistic and expressive voices
  • Cartesia - Realistic voice synthesis for real-time apps
  • AWS Polly - Natural-sounding voices with neural engine
  • Fish Audio - Voice cloning capabilities
  • Inworld - High-quality streaming voices
  • Kokoro - Local TTS for offline synthesis
  • Ultralytics YOLO - Real-time pose and object detection
  • Roboflow - Object detection with hosted API or local models
  • Moondream - Lightweight VLM for detection, caption, and VQA
  • NVIDIA Cosmos 2 - Video understanding with frame buffering
  • Decart - Real-time video transformation and styling
  • HeyGen - Real-time interactive avatars
  • Twilio - Voice call integration with Media Streams
  • TurboPuffer - RAG with hybrid search (vector + BM25)
  • Smart Turn - Advanced turn detection with Silero VAD
  • Vogent - Neural turn detection for conversations
Explore all integrations in the Integrations section.

Next Steps

Quickstart Tutorial

Build your first agent in 5 minutes

Core Concepts

Understand agents, processors, and the architecture

Browse Examples

Explore real-world examples and use cases

API Reference

Dive into the complete API documentation

Community & Support

Join our community to get help, share ideas, and stay updated:
Vision AI LimitationsVideo AI is at the frontier of AI. Keep these limitations in mind:
  • Struggles with small text (e.g., reading game scores)
  • Can lose context with longer videos (30+ seconds)
  • Most applications need a combination of specialized models (YOLO, Roboflow) + larger models (Gemini, OpenAI)
  • Image size & FPS must stay relatively low for performance
  • Video doesn’t trigger responses in realtime models - you need audio/text input

Build docs developers (and LLMs) love