Skip to main content
This example demonstrates how to build voice AI agents that can handle phone calls via Twilio with RAG (Retrieval Augmented Generation) capabilities. It includes both inbound and outbound calling examples.

What You’ll Learn

  • Handling inbound phone calls with Twilio webhooks
  • Making outbound phone calls programmatically
  • Implementing RAG with multiple backends (Gemini File Search or TurboPuffer)
  • Processing Twilio media streams with Vision Agents
  • Converting between Twilio’s mulaw audio and agent audio formats

Features

  • Inbound Calls: Answer phone calls and provide information using RAG
  • Outbound Calls: Initiate calls programmatically (e.g., restaurant reservations)
  • RAG Backend Options:
    • Gemini’s built-in File Search (default)
    • TurboPuffer + LangChain with function calling
  • Knowledge Base: Load documents from a local directory
  • Twilio Integration: Full webhook and media stream handling

Prerequisites

You’ll need:

Setup

1

Clone the repository

git clone [email protected]:GetStream/Vision-Agents.git
cd Vision-Agents
2

Configure environment variables

Create a .env file:
STREAM_API_KEY=your_stream_key
STREAM_API_SECRET=your_stream_secret
GOOGLE_API_KEY=your_gemini_key
TWILIO_ACCOUNT_SID=your_twilio_sid
TWILIO_AUTH_TOKEN=your_twilio_token
TURBO_PUFFER_KEY=your_turbopuffer_key  # Optional
DEEPGRAM_API_KEY=your_deepgram_key
ELEVENLABS_API_KEY=your_elevenlabs_key
3

Start ngrok

In a terminal window:
ngrok http 8000
Copy the ngrok URL (e.g., abc123.ngrok-free.app)
4

Configure Twilio webhook

  1. Login to Twilio Console
  2. Go to Phone Numbers β†’ Manage β†’ Active numbers
  3. Buy a number if you don’t have one
  4. Set β€œA call comes in” webhook to: https://abc123.ngrok-free.app/twilio/voice

Running the Inbound Example

The inbound example answers calls and uses RAG to answer questions about Stream’s APIs.
1

Navigate to the example directory

cd examples/03_phone_and_rag_example
2

Start the server

RAG_BACKEND=gemini NGROK_URL=abc123.ngrok-free.app uv run inbound_phone_and_rag_example.py
3

Call your Twilio number

Call the number you configured in Twilio. The agent will answer and you can ask questions about Stream’s Chat, Video, and Feeds APIs.

RAG Backend Selection

Choose your RAG backend via the RAG_BACKEND environment variable:
# Use Gemini's built-in File Search (default, simpler)
RAG_BACKEND=gemini NGROK_URL=abc123.ngrok-free.app uv run inbound_phone_and_rag_example.py

# Use TurboPuffer with function calling (more control)
RAG_BACKEND=turbopuffer NGROK_URL=abc123.ngrok-free.app uv run inbound_phone_and_rag_example.py

Running the Outbound Example

The outbound example shows how to programmatically initiate calls (e.g., to make restaurant reservations).
cd examples/03_phone_and_rag_example
NGROK_URL=abc123.ngrok-free.app uv run outbound_phone_example.py --from +1234567890 --to +0987654321
Replace:
  • +1234567890 with your Twilio phone number
  • +0987654321 with the number you’re calling

Complete Code (Inbound)

Here’s the core implementation for inbound calls:
import asyncio
import logging
import os
import uuid
from pathlib import Path

import uvicorn
from dotenv import load_dotenv
from fastapi import Depends, FastAPI, WebSocket
from uvicorn.middleware.proxy_headers import ProxyHeadersMiddleware

from vision_agents.core import User, Agent
from vision_agents.plugins import (
    getstream,
    gemini,
    twilio,
    elevenlabs,
    deepgram,
    turbopuffer,
)

logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)

load_dotenv()

NGROK_URL = os.environ["NGROK_URL"]
KNOWLEDGE_DIR = Path(__file__).parent / "knowledge"
RAG_BACKEND = os.environ.get("RAG_BACKEND", "gemini").lower()

file_search_store = None
rag = None

app = FastAPI()
app.add_middleware(ProxyHeadersMiddleware, trusted_hosts=["*"])
call_registry = twilio.TwilioCallRegistry()


@app.post("/twilio/voice")
async def twilio_voice_webhook(
    _: None = Depends(twilio.verify_twilio_signature),
    data: twilio.CallWebhookInput = Depends(twilio.CallWebhookInput.as_form),
):
    logger.info(f"πŸ“ž Call from {data.caller} ({data.caller_city or 'unknown location'})")
    call_id = str(uuid.uuid4())

    async def prepare_call():
        agent = await create_agent()
        phone_number = data.from_number or "unknown"
        sanitized_number = phone_number.replace("+", "").replace(" ", "")
        phone_user = User(
            name=f"Call from {phone_number}", id=f"phone-{sanitized_number}"
        )
        await agent.edge.create_users([phone_user])
        stream_call = await agent.create_call("default", call_id=call_id)
        return agent, phone_user, stream_call

    twilio_call = call_registry.create(call_id, data, prepare=prepare_call)
    url = f"wss://{NGROK_URL}/twilio/media/{call_id}/{twilio_call.token}"
    return twilio.create_media_stream_response(url)


@app.websocket("/twilio/media/{call_id}/{token}")
async def media_stream(websocket: WebSocket, call_id: str, token: str):
    twilio_call = call_registry.validate(call_id, token)
    logger.info(f"πŸ”— Media stream connected for {twilio_call.caller}")

    twilio_stream = twilio.TwilioMediaStream(websocket)
    await twilio_stream.accept()
    twilio_call.twilio_stream = twilio_stream

    try:
        agent, phone_user, stream_call = await twilio_call.await_prepare()
        twilio_call.stream_call = stream_call

        await twilio.attach_phone_to_call(stream_call, twilio_stream, phone_user.id)

        async with agent.join(stream_call, participant_wait_timeout=0):
            await agent.llm.simple_response(
                text="Greet the caller warmly and ask what kind of app they're building. Use your knowledge base to provide relevant product recommendations."
            )
            await twilio_stream.run()
    finally:
        call_registry.remove(call_id)


async def create_agent() -> Agent:
    instructions = """Read the instructions in @instructions.md"""

    if RAG_BACKEND == "turbopuffer":
        llm = gemini.LLM("gemini-2.5-flash-lite")

        @llm.register_function(
            description="Search Stream's product knowledge base for detailed information about Chat, Video, Feeds, and Moderation APIs."
        )
        async def search_knowledge(query: str) -> str:
            return await rag.search(query, top_k=3)
    else:
        llm = gemini.LLM(
            "gemini-2.5-flash-lite",
            tools=[gemini.tools.FileSearch(file_search_store)],
        )

    return Agent(
        edge=getstream.Edge(),
        agent_user=User(id="ai-agent", name="AI"),
        instructions=instructions,
        tts=elevenlabs.TTS(voice_id="FGY2WhTYpPnrIDTdsKH5"),
        stt=deepgram.STT(eager_turn_detection=True),
        llm=llm,
    )


if __name__ == "__main__":
    asyncio.run(create_rag_from_directory())
    logger.info(f"Starting with RAG_BACKEND={RAG_BACKEND}")
    uvicorn.run(app, host="localhost", port=8000)

Understanding the Flow

Inbound Call Flow

  1. Twilio receives a call and triggers the /twilio/voice webhook
  2. Webhook validates Twilio signature and starts preparing the call
  3. Returns TwiML to start a bidirectional media stream to /twilio/media
  4. Media stream WebSocket connects
  5. Agent is created and attached to the phone user
  6. Audio flows: Twilio ↔ Vision Agents ↔ STT/TTS/LLM
  7. Agent uses RAG to answer questions from the knowledge base

RAG Initialization

async def create_rag_from_directory():
    global file_search_store, rag

    if not KNOWLEDGE_DIR.exists():
        logger.warning(f"Knowledge directory not found: {KNOWLEDGE_DIR}")
        return

    if RAG_BACKEND == "turbopuffer":
        logger.info(f"πŸ“š Initializing TurboPuffer RAG from {KNOWLEDGE_DIR}")
        rag = await turbopuffer.create_rag(
            namespace="stream-product-knowledge-gemini",
            knowledge_dir=KNOWLEDGE_DIR,
            extensions=[".md"],
        )
    else:
        logger.info(f"πŸ“š Initializing Gemini File Search from {KNOWLEDGE_DIR}")
        file_search_store = await gemini.create_file_search_store(
            name="stream-product-knowledge",
            knowledge_dir=KNOWLEDGE_DIR,
            extensions=[".md"],
        )

TwiML and WebSockets

Twilio uses TwiML to control phone calls. The create_media_stream_response helper returns TwiML that pipes the call to a WebSocket URL:
url = f"wss://{NGROK_URL}/twilio/media/{call_id}/{twilio_call.token}"
return twilio.create_media_stream_response(url)
The WebSocket endpoint receives real-time audio:
@app.websocket("/twilio/media/{call_id}/{token}")
async def media_stream(websocket: WebSocket, call_id: str, token: str):
    twilio_stream = twilio.TwilioMediaStream(websocket)
    await twilio_stream.accept()
    # ... connect to agent

Audio Format Notes

Twilio uses mulaw audio encoding at 8kHz. Vision Agents handles the conversion automatically through TwilioMediaStream.

Deployment Notes

For optimal latency:
  • Deploy in US-East (closest to Twilio’s servers)
  • Use a production server instead of ngrok
  • Consider using Stream’s edge network for global distribution

Knowledge Base

Place your knowledge documents in the knowledge/ directory:
03_phone_and_rag_example/
β”œβ”€β”€ knowledge/
β”‚   β”œβ”€β”€ chat-api.md
β”‚   β”œβ”€β”€ video-api.md
β”‚   └── feeds-api.md
β”œβ”€β”€ inbound_phone_and_rag_example.py
└── outbound_phone_example.py
The RAG system will index all .md files on startup.

Next Steps

Build docs developers (and LLMs) love