What You’ll Learn
- Setting up a basic voice AI agent
- Configuring speech-to-text (STT), text-to-speech (TTS), and LLM components
- Creating and joining video calls
- Using turn detection for natural conversations
- Registering custom functions for the LLM
Features
- Listens to user speech and converts it to text
- Processes conversations using an LLM (Large Language Model)
- Responds with natural-sounding speech
- Runs on Stream’s low-latency edge network
- Includes weather lookup function example
Prerequisites
Before running this example, you’ll need API keys for:- Gemini (for the LLM)
- ElevenLabs (for text-to-speech)
- Deepgram (for speech-to-text)
- Stream (for video/audio infrastructure)
Setup
Complete Code
Code Walkthrough
Agent Components
The agent is built with several key components:- edge: Handles low-latency audio/video transport via Stream’s edge network
- agent_user: Sets the agent’s name and ID
- instructions: Tells the agent how to behave
- llm: The language model that powers the conversation (Gemini)
- tts: Converts agent responses to speech (ElevenLabs)
- stt: Converts user speech to text (Deepgram with eager turn detection)
- processors: Optional video/audio processing pipeline (empty in this example)
Registering Custom Functions
You can extend the agent’s capabilities by registering custom functions:Turn Detection
This example uses Deepgram’s eager turn detection (eager_turn_detection=True) for lower latency. This means the agent will respond more quickly when it detects you’ve stopped speaking, though it may use slightly more tokens.
Alternative: Using Realtime LLMs
You can simplify the setup by using a realtime LLM like OpenAI Realtime or Gemini Live. These models handle speech-to-text and text-to-speech internally:Customization Ideas
Change the Instructions
Edit theINSTRUCTIONS parameter to change how your agent behaves:
Use Different Models
Swap out any component:Add Video Processing
Add processors to analyze video (see the Golf Coach Example for details):What Happens When You Run It
- The agent creates a video call with a unique ID
- A demo UI opens in your browser automatically
- The agent joins the call and greets you
- You can speak naturally, and the agent will respond
- Try asking about the weather to see the custom function in action
Next Steps
- Explore the Golf Coach Example to learn about video processing
- Check out the Phone RAG Example for building phone agents
- Read the Building Voice AI Apps guide