Skip to main content
Vision Agents supports a wide range of integrations to enhance your voice and video AI applications. These integrations are organized into several categories:

Language Models

Connect to leading LLM providers for natural language understanding and generation:

Gemini

Google’s multimodal AI with vision, speech, and text capabilities

OpenAI

GPT models for chat, vision, and realtime interactions

Anthropic

Claude models for advanced reasoning and conversation

AWS Bedrock

Access to Claude, Qwen, and Nova models on AWS

Speech & Audio

Integrate speech-to-text and text-to-speech capabilities:

Deepgram

High-quality STT and TTS with Flux and Aura models

ElevenLabs

Premium text-to-speech with natural voices

Twilio

Phone call integration with media streaming

Computer Vision

Add object detection and visual understanding:

Ultralytics

YOLO-based pose detection and tracking

Roboflow

Cloud and local object detection with RF-DETR

Moondream

Zero-shot detection and visual question answering

Data & Storage

Enhance your agents with retrieval and memory:

TurboPuffer

Hybrid vector + BM25 search for RAG

All Available Integrations

Vision Agents includes 30+ integrations. Below is the complete list:

Language Model Providers

  • Gemini - Google’s multimodal AI with realtime capabilities
  • OpenAI - GPT models with realtime video support
  • Anthropic - Claude models for advanced reasoning
  • AWS Bedrock - Access to Claude, Qwen, and Nova Sonic models
  • xAI - Grok models with advanced reasoning
  • Hugging Face - Open-source models via Cerebras, Together, Groq
  • OpenRouter - Unified API for multiple providers
  • Qwen - Alibaba’s realtime audio models
  • Mistral - Voxtral for real-time transcription

Speech-to-Text (STT)

  • Deepgram - Nova 3 with speaker diarization
  • Fast-Whisper - High-performance Whisper with CTranslate2
  • Fish Audio - STT with automatic language detection
  • Wizper - Real-time translation with Whisper v3

Text-to-Speech (TTS)

  • ElevenLabs - Premium voices with emotional expression
  • Deepgram - Aura TTS models
  • Cartesia - Sonic 3 for realistic voice synthesis
  • AWS Polly - Natural-sounding voices with neural engine
  • Fish Audio - Voice cloning capabilities
  • Inworld - High-quality streaming voices
  • Kokoro - Local offline TTS engine

Computer Vision

  • Ultralytics - YOLO pose detection and tracking
  • Roboflow - Cloud and local object detection (RF-DETR)
  • Moondream - Zero-shot detection, caption, and VQA
  • NVIDIA Cosmos 2 - Video understanding with frame buffering
  • Decart - Real-time AI video transformation (Mirage 2)

Turn Detection

  • Smart Turn - Silero VAD + Whisper + neural models
  • Vogent - Neural turn detection system

Specialized Services

  • Twilio - Phone call integration with media streaming
  • HeyGen - Real-time interactive avatars
  • TurboPuffer - Hybrid vector + BM25 search for RAG
  • GetStream - Ultra-low latency edge network
The 12 integrations highlighted above have detailed documentation pages. All other integrations follow similar patterns and can be used by installing the corresponding plugin package.

Installation

Most integrations can be installed with the Vision Agents extras syntax:
uv add vision-agents[gemini]
uv add vision-agents[openai]
uv add vision-agents[anthropic]
Some plugins have their own packages:
uv add vision-agents-plugins-deepgram
uv add vision-agents-plugins-elevenlabs

Environment Variables

Each integration requires API credentials. Create a .env file:
# Language Models
GOOGLE_API_KEY=your_google_api_key
OPENAI_API_KEY=your_openai_api_key
ANTHROPIC_API_KEY=your_anthropic_api_key
AWS_ACCESS_KEY_ID=your_aws_key
AWS_SECRET_ACCESS_KEY=your_aws_secret

# Speech & Audio
DEEPGRAM_API_KEY=your_deepgram_key
ELEVENLABS_API_KEY=your_elevenlabs_key
TWILIO_ACCOUNT_SID=your_twilio_sid
TWILIO_AUTH_TOKEN=your_twilio_token

# Vision
ROBOFLOW_API_KEY=your_roboflow_key
MOONDREAM_API_KEY=your_moondream_key

# Data
TURBO_PUFFER_KEY=your_turbopuffer_key

Next Steps

Explore the individual integration guides to learn about configuration options, usage examples, and API details.

Build docs developers (and LLMs) love