Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/HUANGCHIHHUNGLeo/claude-real-video/llms.txt

Use this file to discover all available pages before exploring further.

claude-real-video gives any LLM — Claude, ChatGPT, Gemini, or any other — genuine video comprehension. Instead of sampling frames at a fixed interval (which over-samples static screencasts and under-samples fast edits), it detects every scene change, collapses near-duplicate shots with a sliding-window deduplicator, and produces a clean folder of key frames plus a transcript that any model can read.

Installation

Install via pip with optional Whisper transcription support

Quickstart

Run your first video analysis in under two minutes

CLI Reference

Every flag and option for the crv command

Python API

Call process() directly from your own scripts

Why claude-real-video?

Most “let an LLM watch a video” approaches grab frames at a fixed rate — one per second — and ignore the audio. That means a ten-minute screencast with no cuts sends hundreds of nearly-identical frames, while a fast-cut trailer misses visual changes between samples. claude-real-video is different:

Scene-change detection

Frames are selected at every scene cut plus a configurable density floor — not a fixed quota.

Sliding-window dedup

A-B-A cuts and repeated shots are collapsed so the model only sees each unique shot once.

Smart transcription

Uses existing subtitles (.srt/.vtt or embedded) first; falls back to Whisper only when needed.

Fully local

Runs entirely on your machine. No video is uploaded to any third-party cloud service.

How it works

1

Fetch the video

Point crv at a YouTube, Instagram, or TikTok URL (via yt-dlp) or a local file path. The video stays on your machine.
2

Extract meaningful frames

ffmpeg runs a single chronological pass that captures every scene change plus a density floor — so you get the right frames whether the video is a static slideshow or a rapid-fire reel.
3

Deduplicate with a sliding window

Each candidate frame is compared against the last N kept frames using real pixel difference. Repeated shots — even after a cutaway — are dropped automatically.
4

Transcribe the audio

If the video already has subtitles (sidecar .srt/.vtt or an embedded stream), those are used verbatim. Otherwise Whisper transcribes the audio track.
5

Hand everything to your LLM

Drop the frames/ folder and MANIFEST.txt into Claude, ChatGPT, or any vision-capable model and ask your question.

Quick example

# Analyse a YouTube video
crv "https://www.youtube.com/watch?v=dQw4w9WgXcQ"

# → crv-out/frames/*.jpg   (key frames, deduplicated)
# → crv-out/transcript.txt (full transcript)
# → crv-out/MANIFEST.txt   (summary for the LLM)
Add --why "find the pricing strategy" to write your analysis intent into MANIFEST.txt so the model focuses on what you actually care about instead of producing a generic summary.

Build docs developers (and LLMs) love