Phone + RAG Example

This example demonstrates how to build voice AI agents that can handle phone calls via Twilio with RAG (Retrieval Augmented Generation) capabilities. It includes both inbound and outbound calling examples.

What You’ll Learn

Handling inbound phone calls with Twilio webhooks
Making outbound phone calls programmatically
Implementing RAG with multiple backends (Gemini File Search or TurboPuffer)
Processing Twilio media streams with Vision Agents
Converting between Twilio’s mulaw audio and agent audio formats

Features

Inbound Calls: Answer phone calls and provide information using RAG
Outbound Calls: Initiate calls programmatically (e.g., restaurant reservations)
RAG Backend Options:
- Gemini’s built-in File Search (default)
- TurboPuffer + LangChain with function calling
Knowledge Base: Load documents from a local directory
Twilio Integration: Full webhook and media stream handling

Prerequisites

You’ll need:

Stream API credentials
Gemini API key
Twilio account with phone number
Deepgram API key (for STT)
ElevenLabs API key (for TTS)
TurboPuffer API key (optional, for TurboPuffer RAG backend)
ngrok for local development

Setup

Clone the repository

git clone [email protected]:GetStream/Vision-Agents.git
cd Vision-Agents

Configure environment variables

Create a .env file:

STREAM_API_KEY=your_stream_key
STREAM_API_SECRET=your_stream_secret
GOOGLE_API_KEY=your_gemini_key
TWILIO_ACCOUNT_SID=your_twilio_sid
TWILIO_AUTH_TOKEN=your_twilio_token
TURBO_PUFFER_KEY=your_turbopuffer_key  # Optional
DEEPGRAM_API_KEY=your_deepgram_key
ELEVENLABS_API_KEY=your_elevenlabs_key

Start ngrok

In a terminal window:

ngrok http 8000

Copy the ngrok URL (e.g., abc123.ngrok-free.app)

Configure Twilio webhook

Login to Twilio Console
Go to Phone Numbers → Manage → Active numbers
Buy a number if you don’t have one
Set “A call comes in” webhook to: https://abc123.ngrok-free.app/twilio/voice

Running the Inbound Example

The inbound example answers calls and uses RAG to answer questions about Stream’s APIs.

Navigate to the example directory

cd examples/03_phone_and_rag_example

Start the server

RAG_BACKEND=gemini NGROK_URL=abc123.ngrok-free.app uv run inbound_phone_and_rag_example.py

Call your Twilio number

Call the number you configured in Twilio. The agent will answer and you can ask questions about Stream’s Chat, Video, and Feeds APIs.

RAG Backend Selection

Choose your RAG backend via the RAG_BACKEND environment variable:

# Use Gemini's built-in File Search (default, simpler)
RAG_BACKEND=gemini NGROK_URL=abc123.ngrok-free.app uv run inbound_phone_and_rag_example.py

# Use TurboPuffer with function calling (more control)
RAG_BACKEND=turbopuffer NGROK_URL=abc123.ngrok-free.app uv run inbound_phone_and_rag_example.py

Running the Outbound Example

The outbound example shows how to programmatically initiate calls (e.g., to make restaurant reservations).

cd examples/03_phone_and_rag_example
NGROK_URL=abc123.ngrok-free.app uv run outbound_phone_example.py --from +1234567890 --to +0987654321

Replace:

+1234567890 with your Twilio phone number
+0987654321 with the number you’re calling

Complete Code (Inbound)

Here’s the core implementation for inbound calls:

import asyncio
import logging
import os
import uuid
from pathlib import Path

import uvicorn
from dotenv import load_dotenv
from fastapi import Depends, FastAPI, WebSocket
from uvicorn.middleware.proxy_headers import ProxyHeadersMiddleware

from vision_agents.core import User, Agent
from vision_agents.plugins import (
    getstream,
    gemini,
    twilio,
    elevenlabs,
    deepgram,
    turbopuffer,
)

logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)

load_dotenv()

NGROK_URL = os.environ["NGROK_URL"]
KNOWLEDGE_DIR = Path(__file__).parent / "knowledge"
RAG_BACKEND = os.environ.get("RAG_BACKEND", "gemini").lower()

file_search_store = None
rag = None

app = FastAPI()
app.add_middleware(ProxyHeadersMiddleware, trusted_hosts=["*"])
call_registry = twilio.TwilioCallRegistry()


@app.post("/twilio/voice")
async def twilio_voice_webhook(
    _: None = Depends(twilio.verify_twilio_signature),
    data: twilio.CallWebhookInput = Depends(twilio.CallWebhookInput.as_form),
):
    logger.info(f"📞 Call from {data.caller} ({data.caller_city or 'unknown location'})")
    call_id = str(uuid.uuid4())

    async def prepare_call():
        agent = await create_agent()
        phone_number = data.from_number or "unknown"
        sanitized_number = phone_number.replace("+", "").replace(" ", "")
        phone_user = User(
            name=f"Call from {phone_number}", id=f"phone-{sanitized_number}"
        )
        await agent.edge.create_users([phone_user])
        stream_call = await agent.create_call("default", call_id=call_id)
        return agent, phone_user, stream_call

    twilio_call = call_registry.create(call_id, data, prepare=prepare_call)
    url = f"wss://{NGROK_URL}/twilio/media/{call_id}/{twilio_call.token}"
    return twilio.create_media_stream_response(url)


@app.websocket("/twilio/media/{call_id}/{token}")
async def media_stream(websocket: WebSocket, call_id: str, token: str):
    twilio_call = call_registry.validate(call_id, token)
    logger.info(f"🔗 Media stream connected for {twilio_call.caller}")

    twilio_stream = twilio.TwilioMediaStream(websocket)
    await twilio_stream.accept()
    twilio_call.twilio_stream = twilio_stream

    try:
        agent, phone_user, stream_call = await twilio_call.await_prepare()
        twilio_call.stream_call = stream_call

        await twilio.attach_phone_to_call(stream_call, twilio_stream, phone_user.id)

        async with agent.join(stream_call, participant_wait_timeout=0):
            await agent.llm.simple_response(
                text="Greet the caller warmly and ask what kind of app they're building. Use your knowledge base to provide relevant product recommendations."
            )
            await twilio_stream.run()
    finally:
        call_registry.remove(call_id)


async def create_agent() -> Agent:
    instructions = """Read the instructions in @instructions.md"""

    if RAG_BACKEND == "turbopuffer":
        llm = gemini.LLM("gemini-2.5-flash-lite")

        @llm.register_function(
            description="Search Stream's product knowledge base for detailed information about Chat, Video, Feeds, and Moderation APIs."
        )
        async def search_knowledge(query: str) -> str:
            return await rag.search(query, top_k=3)
    else:
        llm = gemini.LLM(
            "gemini-2.5-flash-lite",
            tools=[gemini.tools.FileSearch(file_search_store)],
        )

    return Agent(
        edge=getstream.Edge(),
        agent_user=User(id="ai-agent", name="AI"),
        instructions=instructions,
        tts=elevenlabs.TTS(voice_id="FGY2WhTYpPnrIDTdsKH5"),
        stt=deepgram.STT(eager_turn_detection=True),
        llm=llm,
    )


if __name__ == "__main__":
    asyncio.run(create_rag_from_directory())
    logger.info(f"Starting with RAG_BACKEND={RAG_BACKEND}")
    uvicorn.run(app, host="localhost", port=8000)

Understanding the Flow

Inbound Call Flow

Twilio receives a call and triggers the /twilio/voice webhook
Webhook validates Twilio signature and starts preparing the call
Returns TwiML to start a bidirectional media stream to /twilio/media
Media stream WebSocket connects
Agent is created and attached to the phone user
Audio flows: Twilio ↔ Vision Agents ↔ STT/TTS/LLM
Agent uses RAG to answer questions from the knowledge base

RAG Initialization

async def create_rag_from_directory():
    global file_search_store, rag

    if not KNOWLEDGE_DIR.exists():
        logger.warning(f"Knowledge directory not found: {KNOWLEDGE_DIR}")
        return

    if RAG_BACKEND == "turbopuffer":
        logger.info(f"📚 Initializing TurboPuffer RAG from {KNOWLEDGE_DIR}")
        rag = await turbopuffer.create_rag(
            namespace="stream-product-knowledge-gemini",
            knowledge_dir=KNOWLEDGE_DIR,
            extensions=[".md"],
        )
    else:
        logger.info(f"📚 Initializing Gemini File Search from {KNOWLEDGE_DIR}")
        file_search_store = await gemini.create_file_search_store(
            name="stream-product-knowledge",
            knowledge_dir=KNOWLEDGE_DIR,
            extensions=[".md"],
        )

TwiML and WebSockets

Twilio uses TwiML to control phone calls. The create_media_stream_response helper returns TwiML that pipes the call to a WebSocket URL:

url = f"wss://{NGROK_URL}/twilio/media/{call_id}/{twilio_call.token}"
return twilio.create_media_stream_response(url)

The WebSocket endpoint receives real-time audio:

@app.websocket("/twilio/media/{call_id}/{token}")
async def media_stream(websocket: WebSocket, call_id: str, token: str):
    twilio_stream = twilio.TwilioMediaStream(websocket)
    await twilio_stream.accept()
    # ... connect to agent

Audio Format Notes

Twilio uses mulaw audio encoding at 8kHz. Vision Agents handles the conversion automatically through TwilioMediaStream.

Deployment Notes

For optimal latency:

Deploy in US-East (closest to Twilio’s servers)
Use a production server instead of ngrok
Consider using Stream’s edge network for global distribution

Knowledge Base

Place your knowledge documents in the knowledge/ directory:

03_phone_and_rag_example/
├── knowledge/
│   ├── chat-api.md
│   ├── video-api.md
│   └── feeds-api.md
├── inbound_phone_and_rag_example.py
└── outbound_phone_example.py

The RAG system will index all .md files on startup.

Next Steps

Explore the RAG Guide for advanced RAG techniques
Try the Simple Agent Example for a simpler voice agent
Read about Twilio Integration for more details

Get Started

Core Concepts

Building Agents

Integrations

Examples

What You’ll Learn

Features

Prerequisites

Setup

Running the Inbound Example

RAG Backend Selection

Running the Outbound Example

Complete Code (Inbound)

Understanding the Flow

Inbound Call Flow

RAG Initialization

TwiML and WebSockets

Audio Format Notes

Deployment Notes

Knowledge Base

Next Steps

Build docs developers (and LLMs) love

Get Started

Core Concepts

Building Agents

Integrations

Examples

​What You’ll Learn

​Features

​Prerequisites

​Setup

​Running the Inbound Example

​RAG Backend Selection

​Running the Outbound Example

​Complete Code (Inbound)

​Understanding the Flow

​Inbound Call Flow

​RAG Initialization

​TwiML and WebSockets

​Audio Format Notes

​Deployment Notes

​Knowledge Base

​Next Steps

Build docs developers (and LLMs) love

What You’ll Learn

Features

Prerequisites

Setup

Running the Inbound Example

RAG Backend Selection

Running the Outbound Example

Complete Code (Inbound)

Understanding the Flow

Inbound Call Flow

RAG Initialization

TwiML and WebSockets

Audio Format Notes

Deployment Notes

Knowledge Base

Next Steps