Frame extraction can produce large numbers of near-identical images. A ten-minute screencast with one static slide will generate hundreds of frames that are pixel-for-pixel identical. An A-B-A edit (cut to a reaction shot and back) will re-introduce a shot the model has already processed. Without deduplication, the model’s context window fills up with redundant images, wasting tokens and diluting the signal.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/HUANGCHIHHUNGLeo/claude-real-video/llms.txt
Use this file to discover all available pages before exploring further.
claude-real-video removes near-duplicates using a sliding-window pixel-difference algorithm before any frame reaches the LLM.
How it works
The deduplication algorithm is implemented indedup_frames() in core.py:
Step 1 — Signature generation
Each candidate frame (raw_*.jpg, in chronological order) is opened with Pillow, converted to RGB, and downscaled to 16×16 pixels. The resulting 256 RGB tuples form the frame’s signature. RGB is used deliberately rather than grayscale or a perceptual hash:
- Perceptual hashes normalise for brightness and can be blind on flat-colour frames (a pure red background and a pure green background may produce identical hashes).
- Grayscale comparators miss equal-luma hue changes — a red-to-green cut where both colours have similar brightness looks like no change at all.
- Per-pixel RGB difference catches both cases correctly.
window kept frames (default 4). The comparison function computes the max channel difference per pixel:
threshold% (default 8). If every window frame is within threshold%, the frame is considered a near-duplicate and dropped.
Step 4 — Window prevents A-B-A recurrence
Because the window holds the last N kept frames (not just the immediately preceding frame), an A-B-A cutaway is correctly identified: after shot A is seen, a cutaway to shot B, and then back to shot A, the second appearance of A is still within the window’s memory and will be dropped. The model only sees each distinct visual once.
After deduplication: the frame cap
If the number of surviving frames exceeds--max-frames (default 150), the list is uniformly thinned so the final set stays spread across the entire video timeline:
step-th survivor is retained; the rest are removed. This preserves temporal coverage (the first and last frames of the video are always represented) rather than simply truncating the tail.
Survivors are then renamed frame_001.jpg, frame_002.jpg, … in chronological order.
Tuning
| Flag | Default | Effect |
|---|---|---|
--dedup-threshold | 8 | Percentage of pixels that must change for a frame to count as new. Higher = fewer frames kept (more aggressive deduplication). Try 15–20 for screencasts; 4–6 for footage where subtle visual changes matter. |
--dedup-window | 4 | Number of previously-kept frames to compare against. 1 = consecutive-only (classic frame differencing). Higher values catch A-B-A cutaways and cyclically repeated shots. Rarely needs to exceed 6–8. |
Debugging with —report
Pass--report to get a full visualisation of every keep/drop decision:
--report is active:
- Dropped frames are moved to
crv-out/dropped/instead of being deleted, so you can inspect what was removed. report.htmlis a self-contained page showing every extracted frame — kept and dropped — with its pixel-diff percentage. Frames are colour-coded:- 🟢 Green outline — kept (diff exceeded threshold)
- 🔴 Red outline, dimmed — dropped as a near-duplicate
- 🟠 Orange outline, dimmed — removed by the
--max-framescap after deduplication
report.html in any browser and look for patterns: too many orange frames means --max-frames is too tight; too many green frames that look visually identical means --dedup-threshold needs to go up; important visual changes showing as red means the threshold is too high or the window is too wide.