Skip to main content
This example shows how to build a real-time golf coaching AI using Vision Agents. The agent uses video processing to watch golf swings and provide feedback through voice conversation, combining YOLO pose detection with realtime LLMs.

What You’ll Learn

  • Using video processors to analyze real-time video
  • Integrating YOLO pose detection with realtime LLMs
  • Loading instructions from markdown files
  • Configuring frame rates for video processing
  • Switching between different realtime LLM providers

Features

  • Watches video of the user’s golf swing in real-time
  • Uses YOLO pose detection to analyze body position and movement
  • Processes video with an LLM (Gemini or OpenAI Realtime)
  • Provides voice feedback on swing technique
  • Runs on Stream’s low-latency edge network

Use Cases

This pattern can be applied to:
  • Sports coaching (golf, tennis, baseball)
  • Physical therapy and rehabilitation
  • Workout form coaching
  • Dance instruction
  • Any application requiring real-time pose or movement analysis

Prerequisites

Before running this example, you’ll need API keys for:
  • Gemini (for realtime LLM with vision)
  • Stream (for video/audio infrastructure)
  • Alternatively: OpenAI (if using OpenAI Realtime instead)

Setup

1

Navigate to the example directory

cd examples/02_golf_coach_example
2

Install dependencies

uv sync
3

Configure environment variables

Create a .env file with your API keys:
GEMINI_API_KEY=your_gemini_key
STREAM_API_KEY=your_stream_key
STREAM_API_SECRET=your_stream_secret
If using OpenAI instead of Gemini:
OPENAI_API_KEY=your_openai_key
4

Run the example

uv run golf_coach_example.py run
The agent will:
  1. Create a video call
  2. Open a demo UI in your browser
  3. Join the call and start watching
  4. Ask you to do a golf swing
  5. Analyze your swing and provide feedback

Complete Code

import logging

from dotenv import load_dotenv
from vision_agents.core import Agent, Runner, User
from vision_agents.core.agents import AgentLauncher
from vision_agents.plugins import gemini, getstream, ultralytics

logger = logging.getLogger(__name__)

load_dotenv()


async def create_agent(**kwargs) -> Agent:
    agent = Agent(
        edge=getstream.Edge(),
        agent_user=User(name="AI golf coach"),
        instructions="Read @golf_coach.md",
        llm=gemini.Realtime(fps=3),
        processors=[
            ultralytics.YOLOPoseProcessor(model_path="yolo26n-pose.pt")
        ],
    )
    return agent


async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
    call = await agent.create_call(call_type, call_id)

    async with agent.join(call):
        await agent.llm.simple_response(
            text="Say hi. After the user does their golf swing offer helpful feedback."
        )
        await agent.finish()


if __name__ == "__main__":
    Runner(AgentLauncher(create_agent=create_agent, join_call=join_call)).cli()

Code Walkthrough

Understanding Processors

Processors enable the agent to analyze video in real-time. The YOLOPoseProcessor detects human poses and body positions in each video frame:
processors=[
    ultralytics.YOLOPoseProcessor(model_path="yolo26n-pose.pt")
]
This information is automatically sent to the LLM so it can understand the user’s body movement during the golf swing.

Frame Rate Configuration

The fps=3 parameter controls how many frames per second the LLM processes:
llm=gemini.Realtime(fps=3)
  • Lower FPS (1-3): Less expensive, suitable for slower movements
  • Higher FPS (5-15): More detailed analysis, better for fast actions, more costly
For golf swings, 3 FPS provides a good balance.

Loading Instructions from Files

Instead of inline instructions, this example loads coaching guidelines from a markdown file:
instructions="Read @golf_coach.md"
The golf_coach.md file contains:
  • Coaching personality and tone (Scottish accent, snarky)
  • Golf swing fundamentals to analyze
  • How to provide feedback
  • Common faults and fixes
This keeps your code clean and makes it easy to iterate on instructions.

How It Works

  1. Video Capture: The user’s camera feeds video to the agent
  2. Pose Detection: YOLO analyzes each frame and extracts body position data
  3. LLM Processing: The realtime LLM receives both the video and pose data
  4. Analysis: The LLM watches the swing and evaluates technique based on the coaching instructions
  5. Feedback: The agent speaks feedback using the personality defined in the instructions

Customization

Adjust Frame Rate

llm = gemini.Realtime(fps=1)   # Lower FPS = less expensive
llm = gemini.Realtime(fps=10)  # Higher FPS = more detailed analysis

Switch to OpenAI

Simply change the LLM provider:
from vision_agents.plugins import openai

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="AI golf coach"),
    instructions="Read @golf_coach.md",
    llm=openai.Realtime(fps=3),  # OpenAI instead of Gemini
    processors=[
        ultralytics.YOLOPoseProcessor(model_path="yolo26n-pose.pt")
    ],
)
Both models support video processing with YOLO.

Modify the Coaching Style

Edit the golf_coach.md file to change:
  • The agent’s personality and voice
  • Coaching focus areas
  • Level of detail in feedback
  • Tone (encouraging vs. critical)

Use Different YOLO Models

# General object detection
ultralytics.YOLOProcessor(model_path="yolo11n.pt")

# Pose detection (current)
ultralytics.YOLOPoseProcessor(model_path="yolo11n-pose.pt")

# Segmentation
ultralytics.YOLOSegmentationProcessor(model_path="yolo11n-seg.pt")

Example Instructions File

Here’s a snippet from golf_coach.md:
You are a voice golf coach. You will watch the user's swing and offer feedback. 
The video clarifies the body position using Yolo's pose analysis, so you'll see their exact movement. 
Speak with a female voice and a heavy Scottish accent. Be a little mean and snarky. 
Do not give feedback if you are not sure or do not see a swing.

# Golf Swing Coaching Guide

## 1. Introduction  
A golf coach's primary responsibility when teaching the swing is to build a foundation 
of solid fundamentals while recognizing and correcting common faults...

## 2. Grip  
The grip is the player's only connection to the club...

Output Example

When you run this example and perform a golf swing, the agent might say:
“Och, that backswing was a wee bit rushed! Slow it down and get a full shoulder turn. Your weight’s all wrong too - shift it to your lead side at impact, not after!”

Performance Notes

  • YOLO pose detection runs efficiently on most hardware
  • Gemini Live adds ~100-200ms latency for video processing
  • OpenAI Realtime is typically faster but may have different accuracy
  • Higher FPS increases both cost and latency

Next Steps

Build docs developers (and LLMs) love