Golf Coach Example

This example shows how to build a real-time golf coaching AI using Vision Agents. The agent uses video processing to watch golf swings and provide feedback through voice conversation, combining YOLO pose detection with realtime LLMs.

What You’ll Learn

Using video processors to analyze real-time video
Integrating YOLO pose detection with realtime LLMs
Loading instructions from markdown files
Configuring frame rates for video processing
Switching between different realtime LLM providers

Features

Watches video of the user’s golf swing in real-time
Uses YOLO pose detection to analyze body position and movement
Processes video with an LLM (Gemini or OpenAI Realtime)
Provides voice feedback on swing technique
Runs on Stream’s low-latency edge network

Use Cases

This pattern can be applied to:

Sports coaching (golf, tennis, baseball)
Physical therapy and rehabilitation
Workout form coaching
Dance instruction
Any application requiring real-time pose or movement analysis

Prerequisites

Before running this example, you’ll need API keys for:

Gemini (for realtime LLM with vision)
Stream (for video/audio infrastructure)
Alternatively: OpenAI (if using OpenAI Realtime instead)

Setup

Navigate to the example directory

cd examples/02_golf_coach_example

Install dependencies

uv sync

Configure environment variables

Create a .env file with your API keys:

GEMINI_API_KEY=your_gemini_key
STREAM_API_KEY=your_stream_key
STREAM_API_SECRET=your_stream_secret

If using OpenAI instead of Gemini:

OPENAI_API_KEY=your_openai_key

Run the example

uv run golf_coach_example.py run

The agent will:

Create a video call
Open a demo UI in your browser
Join the call and start watching
Ask you to do a golf swing
Analyze your swing and provide feedback

Complete Code

import logging

from dotenv import load_dotenv
from vision_agents.core import Agent, Runner, User
from vision_agents.core.agents import AgentLauncher
from vision_agents.plugins import gemini, getstream, ultralytics

logger = logging.getLogger(__name__)

load_dotenv()


async def create_agent(**kwargs) -> Agent:
    agent = Agent(
        edge=getstream.Edge(),
        agent_user=User(name="AI golf coach"),
        instructions="Read @golf_coach.md",
        llm=gemini.Realtime(fps=3),
        processors=[
            ultralytics.YOLOPoseProcessor(model_path="yolo26n-pose.pt")
        ],
    )
    return agent


async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
    call = await agent.create_call(call_type, call_id)

    async with agent.join(call):
        await agent.llm.simple_response(
            text="Say hi. After the user does their golf swing offer helpful feedback."
        )
        await agent.finish()


if __name__ == "__main__":
    Runner(AgentLauncher(create_agent=create_agent, join_call=join_call)).cli()

Code Walkthrough

Understanding Processors

Processors enable the agent to analyze video in real-time. The YOLOPoseProcessor detects human poses and body positions in each video frame:

processors=[
    ultralytics.YOLOPoseProcessor(model_path="yolo26n-pose.pt")
]

This information is automatically sent to the LLM so it can understand the user’s body movement during the golf swing.

Frame Rate Configuration

The fps=3 parameter controls how many frames per second the LLM processes:

llm=gemini.Realtime(fps=3)

Lower FPS (1-3): Less expensive, suitable for slower movements
Higher FPS (5-15): More detailed analysis, better for fast actions, more costly

For golf swings, 3 FPS provides a good balance.

Loading Instructions from Files

Instead of inline instructions, this example loads coaching guidelines from a markdown file:

instructions="Read @golf_coach.md"

The golf_coach.md file contains:

Coaching personality and tone (Scottish accent, snarky)
Golf swing fundamentals to analyze
How to provide feedback
Common faults and fixes

This keeps your code clean and makes it easy to iterate on instructions.

How It Works

Video Capture: The user’s camera feeds video to the agent
Pose Detection: YOLO analyzes each frame and extracts body position data
LLM Processing: The realtime LLM receives both the video and pose data
Analysis: The LLM watches the swing and evaluates technique based on the coaching instructions
Feedback: The agent speaks feedback using the personality defined in the instructions

Customization

Adjust Frame Rate

llm = gemini.Realtime(fps=1)   # Lower FPS = less expensive
llm = gemini.Realtime(fps=10)  # Higher FPS = more detailed analysis

Switch to OpenAI

Simply change the LLM provider:

from vision_agents.plugins import openai

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="AI golf coach"),
    instructions="Read @golf_coach.md",
    llm=openai.Realtime(fps=3),  # OpenAI instead of Gemini
    processors=[
        ultralytics.YOLOPoseProcessor(model_path="yolo26n-pose.pt")
    ],
)

Both models support video processing with YOLO.

Modify the Coaching Style

Edit the golf_coach.md file to change:

The agent’s personality and voice
Coaching focus areas
Level of detail in feedback
Tone (encouraging vs. critical)

Use Different YOLO Models

# General object detection
ultralytics.YOLOProcessor(model_path="yolo11n.pt")

# Pose detection (current)
ultralytics.YOLOPoseProcessor(model_path="yolo11n-pose.pt")

# Segmentation
ultralytics.YOLOSegmentationProcessor(model_path="yolo11n-seg.pt")

Example Instructions File

Here’s a snippet from golf_coach.md:

You are a voice golf coach. You will watch the user's swing and offer feedback. 
The video clarifies the body position using Yolo's pose analysis, so you'll see their exact movement. 
Speak with a female voice and a heavy Scottish accent. Be a little mean and snarky. 
Do not give feedback if you are not sure or do not see a swing.

# Golf Swing Coaching Guide

## 1. Introduction  
A golf coach's primary responsibility when teaching the swing is to build a foundation 
of solid fundamentals while recognizing and correcting common faults...

## 2. Grip  
The grip is the player's only connection to the club...

Output Example

When you run this example and perform a golf swing, the agent might say:

“Och, that backswing was a wee bit rushed! Slow it down and get a full shoulder turn. Your weight’s all wrong too - shift it to your lead side at impact, not after!”

Performance Notes

YOLO pose detection runs efficiently on most hardware
Gemini Live adds ~100-200ms latency for video processing
OpenAI Realtime is typically faster but may have different accuracy
Higher FPS increases both cost and latency

Next Steps

Try the Football Commentator Example for event-driven video analysis
Explore the Security Camera Example for object tracking and detection
Read the Building Video AI Apps guide

Get Started

Core Concepts

Building Agents

Integrations

Examples

What You’ll Learn

Features

Use Cases

Prerequisites

Setup

Complete Code

Code Walkthrough

Understanding Processors

Frame Rate Configuration

Loading Instructions from Files

How It Works

Customization

Adjust Frame Rate

Switch to OpenAI

Modify the Coaching Style

Use Different YOLO Models

Example Instructions File

Output Example

Performance Notes

Next Steps

Build docs developers (and LLMs) love

Get Started

Core Concepts

Building Agents

Integrations

Examples

​What You’ll Learn

​Features

​Use Cases

​Prerequisites

​Setup

​Complete Code

​Code Walkthrough

​Understanding Processors

​Frame Rate Configuration

​Loading Instructions from Files

​How It Works

​Customization

​Adjust Frame Rate

​Switch to OpenAI

​Modify the Coaching Style

​Use Different YOLO Models

​Example Instructions File

​Output Example

​Performance Notes

​Next Steps

Build docs developers (and LLMs) love

What You’ll Learn

Features

Use Cases

Prerequisites

Setup

Complete Code

Code Walkthrough

Understanding Processors

Frame Rate Configuration

Loading Instructions from Files

How It Works

Customization

Adjust Frame Rate

Switch to OpenAI

Modify the Coaching Style

Use Different YOLO Models

Example Instructions File

Output Example

Performance Notes

Next Steps