Language Models
Connect to leading LLM providers for natural language understanding and generation:Gemini
Google’s multimodal AI with vision, speech, and text capabilities
OpenAI
GPT models for chat, vision, and realtime interactions
Anthropic
Claude models for advanced reasoning and conversation
AWS Bedrock
Access to Claude, Qwen, and Nova models on AWS
Speech & Audio
Integrate speech-to-text and text-to-speech capabilities:Deepgram
High-quality STT and TTS with Flux and Aura models
ElevenLabs
Premium text-to-speech with natural voices
Twilio
Phone call integration with media streaming
Computer Vision
Add object detection and visual understanding:Ultralytics
YOLO-based pose detection and tracking
Roboflow
Cloud and local object detection with RF-DETR
Moondream
Zero-shot detection and visual question answering
Data & Storage
Enhance your agents with retrieval and memory:TurboPuffer
Hybrid vector + BM25 search for RAG
All Available Integrations
Vision Agents includes 30+ integrations. Below is the complete list:Language Model Providers
- Gemini - Google’s multimodal AI with realtime capabilities
- OpenAI - GPT models with realtime video support
- Anthropic - Claude models for advanced reasoning
- AWS Bedrock - Access to Claude, Qwen, and Nova Sonic models
- xAI - Grok models with advanced reasoning
- Hugging Face - Open-source models via Cerebras, Together, Groq
- OpenRouter - Unified API for multiple providers
- Qwen - Alibaba’s realtime audio models
- Mistral - Voxtral for real-time transcription
Speech-to-Text (STT)
- Deepgram - Nova 3 with speaker diarization
- Fast-Whisper - High-performance Whisper with CTranslate2
- Fish Audio - STT with automatic language detection
- Wizper - Real-time translation with Whisper v3
Text-to-Speech (TTS)
- ElevenLabs - Premium voices with emotional expression
- Deepgram - Aura TTS models
- Cartesia - Sonic 3 for realistic voice synthesis
- AWS Polly - Natural-sounding voices with neural engine
- Fish Audio - Voice cloning capabilities
- Inworld - High-quality streaming voices
- Kokoro - Local offline TTS engine
Computer Vision
- Ultralytics - YOLO pose detection and tracking
- Roboflow - Cloud and local object detection (RF-DETR)
- Moondream - Zero-shot detection, caption, and VQA
- NVIDIA Cosmos 2 - Video understanding with frame buffering
- Decart - Real-time AI video transformation (Mirage 2)
Turn Detection
- Smart Turn - Silero VAD + Whisper + neural models
- Vogent - Neural turn detection system
Specialized Services
- Twilio - Phone call integration with media streaming
- HeyGen - Real-time interactive avatars
- TurboPuffer - Hybrid vector + BM25 search for RAG
- GetStream - Ultra-low latency edge network
The 12 integrations highlighted above have detailed documentation pages. All other integrations follow similar patterns and can be used by installing the corresponding plugin package.
Installation
Most integrations can be installed with the Vision Agents extras syntax:Environment Variables
Each integration requires API credentials. Create a.env file: