Build Real-Time Vision AI Agents
Vision Agents gives you the building blocks to create intelligent, low-latency video experiences powered by your models, your infrastructure, and your use cases.Multi-modal AI agents that watch, listen, and understand video
Combine YOLO, Roboflow, and other vision models with Gemini, OpenAI, and Claude in real-time to build the next generation of AI applications.Quick Start
Get your first agent running in minutes with a complete example
Installation
Install Vision Agents with uv and optional plugin integrations
Voice Agents
Build conversational AI with speech-to-text and text-to-speech
Video Agents
Create vision AI that processes and understands video in real-time
Key Highlights
Video AI
Built for real-time video AI. Combine YOLO, Roboflow, and others with Gemini/OpenAI in real-time.
Low Latency
Join quickly (500ms) and maintain audio/video latency under 30ms using Stream’s edge network.
Open Platform
Built by Stream, but works with any video edge network.
Native APIs
Native SDK methods from OpenAI, Gemini, and Claude — always access the latest LLM capabilities.
Core Features
| Feature | Description |
|---|---|
| True real-time via WebRTC | Stream directly to model providers that support it for instant visual understanding |
| Interval/processor pipeline | For providers without WebRTC, process frames with pluggable video processors (e.g., YOLO, Roboflow, or custom PyTorch/ONNX) before/after model calls |
| Turn detection & diarization | Keep conversations natural; know when the agent should speak or stay quiet and who’s talking |
| Voice activity detection (VAD) | Trigger actions intelligently and use resources efficiently |
| Speech↔Text↔Speech | Enable low-latency loops for smooth, conversational voice UX |
| Tool/function calling | Execute arbitrary code and APIs mid-conversation. Create Linear issues, query weather, trigger telephony, or hit internal services |
| Built-in memory via Stream Chat | Agents recall context naturally across turns and sessions |
| Text back-channel | Message the agent silently during a call |
| Phone and RAG | Interact with the Agent via inbound or outbound phone calls using Twilio and Turbopuffer |
Quick Example
Here’s a simple voice AI agent that can have conversations and call functions:This example uses separate STT, LLM, and TTS components. You can simplify this by using realtime LLMs like
gemini.Realtime() or openai.Realtime() that handle speech natively.Use Cases
Vision Agents is perfect for building:- Sports Coaching - Analyze golf swings, basketball shots, or yoga poses with YOLO and provide real-time feedback
- Security Systems - Detect faces, track packages, and respond to theft with automated alerts
- Healthcare - Monitor physical therapy sessions, track patient movements, or provide workout guidance
- Education - Interactive tutoring with visual understanding and voice interaction
- Customer Support - Video-enabled support agents that can see and help with user issues
- Gaming - Just Dance-style games or interactive experiences with pose detection
Getting Started
Get API Keys
Sign up for Stream to get free API credentials. Developers receive 333,000 participant minutes per month, plus extra credits via the Maker Program.
Build Your First Agent
Follow the quickstart guide to create your first voice or video AI agent.
Available Integrations
Vision Agents comes with 35+ out-of-the-box integrations:LLM Providers
LLM Providers
- OpenAI - Realtime API with video support, LLMs, and TTS
- Gemini - Realtime API, Gemini Live, and VLM interface
- Anthropic Claude - Advanced reasoning with vision capabilities
- AWS Bedrock - Amazon Nova models with realtime speech-to-speech
- Qwen - Alibaba’s Qwen3 with native audio output
- xAI Grok - Advanced reasoning and real-time knowledge
- OpenRouter - Access multiple providers through unified API
- Hugging Face - Open-source models via Cerebras, Together, Groq
Speech-to-Text
Speech-to-Text
- Deepgram - Fast, accurate transcription with speaker diarization
- Fast-Whisper - OpenAI’s Whisper with CTranslate2 acceleration
- Fish Audio - Automatic language detection
- Wizper - Real-time translation with Whisper v3
- Mistral Voxtral - Real-time transcription with diarization
Text-to-Speech
Text-to-Speech
- ElevenLabs - Highly realistic and expressive voices
- Cartesia - Realistic voice synthesis for real-time apps
- AWS Polly - Natural-sounding voices with neural engine
- Fish Audio - Voice cloning capabilities
- Inworld - High-quality streaming voices
- Kokoro - Local TTS for offline synthesis
Vision & Video Processing
Vision & Video Processing
- Ultralytics YOLO - Real-time pose and object detection
- Roboflow - Object detection with hosted API or local models
- Moondream - Lightweight VLM for detection, caption, and VQA
- NVIDIA Cosmos 2 - Video understanding with frame buffering
- Decart - Real-time video transformation and styling
Specialized Services
Specialized Services
- HeyGen - Real-time interactive avatars
- Twilio - Voice call integration with Media Streams
- TurboPuffer - RAG with hybrid search (vector + BM25)
- Smart Turn - Advanced turn detection with Silero VAD
- Vogent - Neural turn detection for conversations
Explore all integrations in the Integrations section.
Next Steps
Quickstart Tutorial
Build your first agent in 5 minutes
Core Concepts
Understand agents, processors, and the architecture
Browse Examples
Explore real-world examples and use cases
API Reference
Dive into the complete API documentation
Community & Support
Join our community to get help, share ideas, and stay updated:- Discord - Join the conversation
- GitHub - Star the repo and contribute
- Documentation - Browse guides and tutorials at visionagents.ai