What You’ll Learn
- Using video processors to analyze real-time video
- Integrating YOLO pose detection with realtime LLMs
- Loading instructions from markdown files
- Configuring frame rates for video processing
- Switching between different realtime LLM providers
Features
- Watches video of the user’s golf swing in real-time
- Uses YOLO pose detection to analyze body position and movement
- Processes video with an LLM (Gemini or OpenAI Realtime)
- Provides voice feedback on swing technique
- Runs on Stream’s low-latency edge network
Use Cases
This pattern can be applied to:- Sports coaching (golf, tennis, baseball)
- Physical therapy and rehabilitation
- Workout form coaching
- Dance instruction
- Any application requiring real-time pose or movement analysis
Prerequisites
Before running this example, you’ll need API keys for:- Gemini (for realtime LLM with vision)
- Stream (for video/audio infrastructure)
- Alternatively: OpenAI (if using OpenAI Realtime instead)
Setup
Configure environment variables
Create a If using OpenAI instead of Gemini:
.env file with your API keys:Complete Code
Code Walkthrough
Understanding Processors
Processors enable the agent to analyze video in real-time. TheYOLOPoseProcessor detects human poses and body positions in each video frame:
Frame Rate Configuration
Thefps=3 parameter controls how many frames per second the LLM processes:
- Lower FPS (1-3): Less expensive, suitable for slower movements
- Higher FPS (5-15): More detailed analysis, better for fast actions, more costly
Loading Instructions from Files
Instead of inline instructions, this example loads coaching guidelines from a markdown file:golf_coach.md file contains:
- Coaching personality and tone (Scottish accent, snarky)
- Golf swing fundamentals to analyze
- How to provide feedback
- Common faults and fixes
How It Works
- Video Capture: The user’s camera feeds video to the agent
- Pose Detection: YOLO analyzes each frame and extracts body position data
- LLM Processing: The realtime LLM receives both the video and pose data
- Analysis: The LLM watches the swing and evaluates technique based on the coaching instructions
- Feedback: The agent speaks feedback using the personality defined in the instructions
Customization
Adjust Frame Rate
Switch to OpenAI
Simply change the LLM provider:Modify the Coaching Style
Edit thegolf_coach.md file to change:
- The agent’s personality and voice
- Coaching focus areas
- Level of detail in feedback
- Tone (encouraging vs. critical)
Use Different YOLO Models
Example Instructions File
Here’s a snippet fromgolf_coach.md:
Output Example
When you run this example and perform a golf swing, the agent might say:“Och, that backswing was a wee bit rushed! Slow it down and get a full shoulder turn. Your weight’s all wrong too - shift it to your lead side at impact, not after!”
Performance Notes
- YOLO pose detection runs efficiently on most hardware
- Gemini Live adds ~100-200ms latency for video processing
- OpenAI Realtime is typically faster but may have different accuracy
- Higher FPS increases both cost and latency
Next Steps
- Try the Football Commentator Example for event-driven video analysis
- Explore the Security Camera Example for object tracking and detection
- Read the Building Video AI Apps guide