Agent class is the heart of Vision Agents. It orchestrates all components needed to build real-time multimodal AI applications, coordinating edge networks, LLMs, processors, and conversational flows.
Overview
An Agent manages the complete lifecycle of a real-time AI interaction:- Joins video/audio calls via edge networks
- Routes audio through STT → LLM → TTS pipeline (or uses realtime LLMs)
- Processes video streams with custom processors
- Handles turn detection and conversational flow
- Manages function calling and tool execution
- Maintains conversation history and chat integration
Architecture
Basic Usage
Realtime Mode
With realtime LLMs (like Gemini Realtime), the agent handles audio directly without separate STT/TTS:Interval Mode
For traditional LLMs, you provide separate STT, TTS, and turn detection:Key Components
Edge Network
Handles real-time audio/video transport. See Edge Networks.LLM Integration
The brain of your agent. Can be:- Standard LLM: Requires STT/TTS (OpenAI, Anthropic, Google)
- Audio LLM: Processes audio directly (OpenAI Realtime)
- Video LLM: Can analyze video streams (Gemini)
- Omni LLM: Handles both audio and video (Gemini Realtime)
Processors
Extend agent capabilities with custom processing. See Processors.MCP Servers
Provide external tool access via Model Context Protocol:Agent Lifecycle
Initialization
When you create anAgent, it:
- Validates configuration (ensures realtime LLMs don’t have STT/TTS)
- Sets up event managers and merges events from all plugins
- Prepares RTC tracks for audio/video
- Attaches processors to the agent
- Initializes the MCP manager if servers are provided
Don’t reuse agent objects. Create a new
Agent instance for each call session.Joining a Call
Thejoin() method is an async context manager:
- Starts tracing for observability
- Connects to MCP servers
- Connects realtime LLM (if applicable)
- Authenticates agent user with edge network
- Establishes RTC connection
- Publishes audio/video tracks
- Creates conversation for chat integration
- Waits for participants (configurable timeout)
- Starts consuming incoming audio
- Starts metrics broadcast (if enabled)
agents.py:615-711
Event Flow
The agent orchestrates events across all components: Reference:agents.py:323-476
Cleanup
The agent cleans up automatically when the context manager exits:agents.py:864-917
Advanced Features
Streaming TTS
Reduce latency by streaming LLM chunks to TTS as sentences complete:agents.py:363-383
Multi-Speaker Handling
Automatically handle multiple participants with audio filtering:agents.py:194-196
Metrics Broadcasting
Broadcast agent metrics to call participants:Programmatic Agent Speech
Make the agent say something directly:agents.py:1026-1062
Configuration Options
Constructor Parameters
| Parameter | Type | Description |
|---|---|---|
edge | EdgeTransport | Edge network for audio/video transport |
llm | LLM | AudioLLM | VideoLLM | Language model (with optional audio/video) |
agent_user | User | Agent’s identity |
instructions | str | System prompt (supports @file.md references) |
stt | Optional[STT] | Speech-to-text (not needed for realtime LLMs) |
tts | Optional[TTS] | Text-to-speech (not needed for realtime LLMs) |
turn_detection | Optional[TurnDetector] | Conversational turn detection |
processors | Optional[List[Processor]] | Custom audio/video processors |
mcp_servers | Optional[List[MCPBaseServer]] | MCP servers for tools |
options | Optional[AgentOptions] | Advanced configuration |
streaming_tts | bool | Stream LLM chunks to TTS (default: False) |
broadcast_metrics | bool | Broadcast metrics to participants (default: False) |
multi_speaker_filter | Optional[AudioFilter] | Multi-speaker audio filter |
agents.py:109-171
Agent Options
agent_types.py:15-26
Event Subscriptions
Subscribe to agent-wide events:agents.py:600-612
Observability
OpenTelemetry Tracing
The agent automatically creates spans for key operations:join- Full call lifecycleedge.authenticate- Authenticationedge.join- Network connectionsimple_response- LLM interactions- Conversation sync operations
Metrics Collection
Agent metrics are automatically collected:Code References
All code examples are based on the actual implementation:- Agent class:
agents-core/vision_agents/core/agents/agents.py:82-917 - Agent types:
agents-core/vision_agents/core/agents/agent_types.py - Event handling:
agents.py:323-577 - Join flow:
agents.py:615-711 - Cleanup:
agents.py:838-917
Best Practices
- Use context managers: Always use
async with agent.join(call)to ensure proper cleanup - Handle errors gracefully: Wrap agent operations in try/except blocks
- Choose the right LLM mode: Use realtime LLMs for lowest latency
- Configure timeouts: Set appropriate
participant_wait_timeoutfor your use case - Monitor metrics: Enable metrics broadcasting for production observability
- Leverage processors: Use processors for video analysis, not LLM function calls
- Use MCP for tools: Prefer MCP servers over custom function implementations
Next Steps
- Learn about Edge Networks for real-time transport
- Explore Processors for custom logic
- Understand Turn Detection for natural conversations
- Compare Realtime vs Interval modes
- Implement Function Calling for tool use