Integrations Overview

Vision Agents supports a wide range of integrations to enhance your voice and video AI applications. These integrations are organized into several categories:

Language Models

Connect to leading LLM providers for natural language understanding and generation:

Gemini

Google’s multimodal AI with vision, speech, and text capabilities

OpenAI

GPT models for chat, vision, and realtime interactions

Anthropic

Claude models for advanced reasoning and conversation

AWS Bedrock

Access to Claude, Qwen, and Nova models on AWS

Speech & Audio

Integrate speech-to-text and text-to-speech capabilities:

Deepgram

High-quality STT and TTS with Flux and Aura models

ElevenLabs

Premium text-to-speech with natural voices

Twilio

Phone call integration with media streaming

Computer Vision

Add object detection and visual understanding:

Ultralytics

YOLO-based pose detection and tracking

Roboflow

Cloud and local object detection with RF-DETR

Moondream

Zero-shot detection and visual question answering

Data & Storage

Enhance your agents with retrieval and memory:

TurboPuffer

Hybrid vector + BM25 search for RAG

All Available Integrations

Vision Agents includes 30+ integrations. Below is the complete list:

Language Model Providers

Gemini - Google’s multimodal AI with realtime capabilities
OpenAI - GPT models with realtime video support
Anthropic - Claude models for advanced reasoning
AWS Bedrock - Access to Claude, Qwen, and Nova Sonic models
xAI - Grok models with advanced reasoning
Hugging Face - Open-source models via Cerebras, Together, Groq
OpenRouter - Unified API for multiple providers
Qwen - Alibaba’s realtime audio models
Mistral - Voxtral for real-time transcription

Speech-to-Text (STT)

Deepgram - Nova 3 with speaker diarization
Fast-Whisper - High-performance Whisper with CTranslate2
Fish Audio - STT with automatic language detection
Wizper - Real-time translation with Whisper v3

Text-to-Speech (TTS)

ElevenLabs - Premium voices with emotional expression
Deepgram - Aura TTS models
Cartesia - Sonic 3 for realistic voice synthesis
AWS Polly - Natural-sounding voices with neural engine
Fish Audio - Voice cloning capabilities
Inworld - High-quality streaming voices
Kokoro - Local offline TTS engine

Computer Vision

Ultralytics - YOLO pose detection and tracking
Roboflow - Cloud and local object detection (RF-DETR)
Moondream - Zero-shot detection, caption, and VQA
NVIDIA Cosmos 2 - Video understanding with frame buffering
Decart - Real-time AI video transformation (Mirage 2)

Turn Detection

Smart Turn - Silero VAD + Whisper + neural models
Vogent - Neural turn detection system

Specialized Services

Twilio - Phone call integration with media streaming
HeyGen - Real-time interactive avatars
TurboPuffer - Hybrid vector + BM25 search for RAG
GetStream - Ultra-low latency edge network

The 12 integrations highlighted above have detailed documentation pages. All other integrations follow similar patterns and can be used by installing the corresponding plugin package.

Installation

Most integrations can be installed with the Vision Agents extras syntax:

uv add vision-agents[gemini]
uv add vision-agents[openai]
uv add vision-agents[anthropic]

Some plugins have their own packages:

uv add vision-agents-plugins-deepgram
uv add vision-agents-plugins-elevenlabs

Environment Variables

Each integration requires API credentials. Create a .env file:

# Language Models
GOOGLE_API_KEY=your_google_api_key
OPENAI_API_KEY=your_openai_api_key
ANTHROPIC_API_KEY=your_anthropic_api_key
AWS_ACCESS_KEY_ID=your_aws_key
AWS_SECRET_ACCESS_KEY=your_aws_secret

# Speech & Audio
DEEPGRAM_API_KEY=your_deepgram_key
ELEVENLABS_API_KEY=your_elevenlabs_key
TWILIO_ACCOUNT_SID=your_twilio_sid
TWILIO_AUTH_TOKEN=your_twilio_token

# Vision
ROBOFLOW_API_KEY=your_roboflow_key
MOONDREAM_API_KEY=your_moondream_key

# Data
TURBO_PUFFER_KEY=your_turbopuffer_key

Next Steps

Explore the individual integration guides to learn about configuration options, usage examples, and API details.

Get Started

Core Concepts

Building Agents

Integrations

Examples

Language Models

Gemini

OpenAI

Anthropic

AWS Bedrock

Speech & Audio

Deepgram

ElevenLabs

Twilio

Computer Vision

Ultralytics

Roboflow

Moondream

Data & Storage

TurboPuffer

All Available Integrations

Language Model Providers

Speech-to-Text (STT)

Text-to-Speech (TTS)

Computer Vision

Turn Detection

Specialized Services

Installation

Environment Variables

Next Steps

Build docs developers (and LLMs) love

Get Started

Core Concepts

Building Agents

Integrations

Examples

​Language Models

Gemini

OpenAI

Anthropic

AWS Bedrock

​Speech & Audio

Deepgram

ElevenLabs

Twilio

​Computer Vision

Ultralytics

Roboflow

Moondream

​Data & Storage

TurboPuffer

​All Available Integrations

​Language Model Providers

​Speech-to-Text (STT)

​Text-to-Speech (TTS)

​Computer Vision

​Turn Detection

​Specialized Services

​Installation

​Environment Variables

​Next Steps

Build docs developers (and LLMs) love

Language Models

Speech & Audio

Computer Vision

Data & Storage

All Available Integrations

Language Model Providers

Speech-to-Text (STT)

Text-to-Speech (TTS)

Computer Vision

Turn Detection

Specialized Services

Installation

Environment Variables

Next Steps