The naive approach to extracting frames — sample one every N seconds — fails at both extremes. A static screencast with a single talking head generates hundreds of near-identical frames, flooding the model’s context window with redundant images. A fast-cut music video or film trailer sees whole scenes pass between samples, leaving the model blind to significant visual changes.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/HUANGCHIHHUNGLeo/claude-real-video/llms.txt
Use this file to discover all available pages before exploring further.
claude-real-video solves both problems with a single ffmpeg pass that combines scene-change detection with a density floor, then removes any remaining duplicates before the LLM ever sees a frame.
The ffmpeg filter
Frame extraction runs inextract_frames() via a single ffmpeg select filter expression:
+ (logical OR):
gt(scene,{scene})— fires on any frame where ffmpeg’s built-in scene-change score exceeds the threshold. The score is a normalised value between 0 and 1 representing how much the current frame differs from the previous one. A lower threshold means more sensitive detection and more frames.not(mod(n,{every_n}))— fires on every Nth frame number, providing the density floor.every_nis computed asmax(1, round(fps × fps_floor)), so at 25 fps with--fps-floor 1.0it keeps one frame per second regardless of how static the footage is. Themax(1, …)guard ensures at least every frame is eligible even at very low frame rates.
-vsync vfr, all selected frames emerge in strict chronological order — critical for the deduplication stage, which compares true temporal neighbours.
The full command run against the fetched source.mp4 is:
{out_dir}/frames/ as raw_00001.jpg, raw_00002.jpg, …
Tuning the parameters
| Flag | Default | Range / Type | Effect |
|---|---|---|---|
--scene | 0.30 | 0.0–1.0 (float) | Scene-change sensitivity. Lower = more scene-triggered frames. 0.10 catches subtle cuts; 0.50 only fires on hard scene changes. |
--fps-floor | 1.0 | seconds (float) | Minimum guaranteed density: at least one frame every N seconds. 0.5 doubles the floor density; 5.0 only guarantees one frame every 5 seconds for very slow content. |
--max-frames | 150 | integer | Hard cap on the final frame count after deduplication. Uniform thinning is applied if survivors exceed this limit. |
- Screencasts and slide decks — the visuals barely change between scene triggers. Raise
--scene(e.g.0.50) to reduce noise from marginal changes, and trust the floor to catch genuine slide transitions. - Fast-cut reels, trailers, action footage — many hard cuts per second. Lower
--fps-floor(e.g.0.25) and--scene(e.g.0.15) to ensure every cut is captured. - Long lectures or interviews — a single talking head with rare scene changes. The default
--fps-floor 1.0will produce one frame per second, so raise--fps-floor(e.g.3.0or5.0) and let deduplication collapse the static runs.