What Youβll Learn
- Handling inbound phone calls with Twilio webhooks
- Making outbound phone calls programmatically
- Implementing RAG with multiple backends (Gemini File Search or TurboPuffer)
- Processing Twilio media streams with Vision Agents
- Converting between Twilioβs mulaw audio and agent audio formats
Features
- Inbound Calls: Answer phone calls and provide information using RAG
- Outbound Calls: Initiate calls programmatically (e.g., restaurant reservations)
- RAG Backend Options:
- Geminiβs built-in File Search (default)
- TurboPuffer + LangChain with function calling
- Knowledge Base: Load documents from a local directory
- Twilio Integration: Full webhook and media stream handling
Prerequisites
Youβll need:- Stream API credentials
- Gemini API key
- Twilio account with phone number
- Deepgram API key (for STT)
- ElevenLabs API key (for TTS)
- TurboPuffer API key (optional, for TurboPuffer RAG backend)
- ngrok for local development
Setup
Configure Twilio webhook
- Login to Twilio Console
- Go to Phone Numbers β Manage β Active numbers
- Buy a number if you donβt have one
- Set βA call comes inβ webhook to:
https://abc123.ngrok-free.app/twilio/voice
Running the Inbound Example
The inbound example answers calls and uses RAG to answer questions about Streamβs APIs.RAG Backend Selection
Choose your RAG backend via theRAG_BACKEND environment variable:
Running the Outbound Example
The outbound example shows how to programmatically initiate calls (e.g., to make restaurant reservations).+1234567890with your Twilio phone number+0987654321with the number youβre calling
Complete Code (Inbound)
Hereβs the core implementation for inbound calls:Understanding the Flow
Inbound Call Flow
- Twilio receives a call and triggers the
/twilio/voicewebhook - Webhook validates Twilio signature and starts preparing the call
- Returns TwiML to start a bidirectional media stream to
/twilio/media - Media stream WebSocket connects
- Agent is created and attached to the phone user
- Audio flows: Twilio β Vision Agents β STT/TTS/LLM
- Agent uses RAG to answer questions from the knowledge base
RAG Initialization
TwiML and WebSockets
Twilio uses TwiML to control phone calls. Thecreate_media_stream_response helper returns TwiML that pipes the call to a WebSocket URL:
Audio Format Notes
Twilio uses mulaw audio encoding at 8kHz. Vision Agents handles the conversion automatically throughTwilioMediaStream.
Deployment Notes
For optimal latency:- Deploy in US-East (closest to Twilioβs servers)
- Use a production server instead of ngrok
- Consider using Streamβs edge network for global distribution
Knowledge Base
Place your knowledge documents in theknowledge/ directory:
.md files on startup.
Next Steps
- Explore the RAG Guide for advanced RAG techniques
- Try the Simple Agent Example for a simpler voice agent
- Read about Twilio Integration for more details