Environment Setup

This guide covers setting up your development environment, installing dependencies, and organizing your data directory structure.

Prerequisites

Before starting, ensure you have:

Python 3.8 or higher
FFmpeg (required for audio extraction)
4GB+ available RAM (8GB+ recommended for GPU acceleration)
Optional: NVIDIA GPU with CUDA support for faster processing

System Dependencies

Install FFmpeg

FFmpeg is required for extracting audio from video files.macOS:

brew install ffmpeg

Ubuntu/Debian:

sudo apt update
sudo apt install ffmpeg

Windows: Download from ffmpeg.org and add to PATH.Verify installation:

ffmpeg -version

Install Python Dependencies

The project uses PyTorch, CLIP, Whisper, and FastAPI. Install all dependencies:

pip install torch torchvision torchaudio
pip install openai-whisper
pip install git+https://github.com/openai/CLIP.git
pip install opencv-python pillow numpy
pip install scikit-learn
pip install fastapi uvicorn
pip install tqdm

For GPU acceleration, install PyTorch with CUDA support following instructions at pytorch.org

Verify Installation

Test that all models can be loaded:

import torch
import clip
import whisper

# Check device
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# Load CLIP
clip_model, preprocess = clip.load("ViT-B/32", device=device)
print("CLIP loaded successfully")

# Load Whisper
whisper_model = whisper.load_model("base", device=device)
print("Whisper loaded successfully")

Expected output:

Using device: cuda  # or cpu
CLIP loaded successfully
Whisper loaded successfully

Directory Structure

The project expects the following directory layout:

tiktok-sorter/
├── extract_features.py
├── train.py
├── predict.py
├── server.py
├── index.html
├── data/
│   └── Favorites/
│       └── videos/
│           ├── 123456789.mp4        # Unsorted videos (root level)
│           ├── 987654321.mp4
│           ├── soccer/               # Category folder
│           │   ├── 111111111.mp4
│           │   └── 222222222.mp4
│           ├── cooking/
│           │   └── 333333333.mp4
│           └── funny/
│               └── 444444444.mp4
└── artifacts/                        # Created automatically
    ├── labeled_embeddings.pt
    ├── unlabeled_embeddings.pt
    ├── transcripts.json
    ├── model.pt
    ├── model_config.json
    └── predictions.json

Create Directory Structure

mkdir -p data/Favorites/videos
mkdir -p artifacts

Organize Your Videos

Place your TikTok videos in the appropriate locations:Labeled Videos (for training):

Create a subfolder for each category: data/Favorites/videos/[category-name]/
Move videos into their respective category folders
Examples: soccer/, cooking/, funny/, motivational/

Unlabeled Videos (for prediction):

Place directly in data/Favorites/videos/
These will be automatically sorted by the model

You need at least 5-10 labeled videos per category for meaningful training results. Categories with fewer examples may not be learned effectively.

Verify Data Structure

Check your setup:

ls -R data/Favorites/videos/

Expected output:

data/Favorites/videos/:
123456789.mp4  987654321.mp4  soccer/  cooking/  funny/

data/Favorites/videos/soccer:
111111111.mp4  222222222.mp4

data/Favorites/videos/cooking:
333333333.mp4

data/Favorites/videos/funny:
444444444.mp4

Configuration

The scripts use hardcoded paths relative to the script location. If you need to customize paths, edit these constants: extract_features.py:

DATA_DIR = Path(__file__).parent / "data" / "Favorites" / "videos"
OUTPUT_DIR = Path(__file__).parent / "artifacts"
N_FRAMES = 5              # Number of frames to sample per video
CLIP_MODEL = "ViT-B/32"   # CLIP model variant
WHISPER_MODEL = "base"    # Whisper model size (tiny/base/small/medium/large)

train.py, predict.py, server.py:

ARTIFACTS_DIR = Path(__file__).parent / "artifacts"
DATA_DIR = Path(__file__).parent / "data" / "Favorites" / "videos"

Model Configuration Options

CLIP Models:

ViT-B/32 - Default, balanced speed/accuracy (512-d embeddings)
ViT-B/16 - Higher accuracy, slower
RN50 - ResNet-50 backbone alternative

Whisper Models:

tiny - Fastest, least accurate (~1GB VRAM)
base - Default, good balance (~1GB VRAM)
small - Better transcription (~2GB VRAM)
medium - High accuracy (~5GB VRAM)
large - Best quality (~10GB VRAM)

Larger models improve feature quality but increase extraction time significantly.

GPU Acceleration

The scripts automatically detect and use CUDA if available:

# Check if PyTorch can see your GPU
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
python -c "import torch; print(f'GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'None'}')"

Performance comparison (600 videos):

CPU: ~45-60 minutes for feature extraction
GPU (RTX 3060): ~8-12 minutes for feature extraction

Training is fast (~5-30 seconds) regardless of device since it only trains on extracted features.

Troubleshooting

ImportError: No module named 'clip'

CLIP must be installed from GitHub, not PyPI:

pip install git+https://github.com/openai/CLIP.git

FFmpeg not found error

Ensure FFmpeg is in your PATH:

which ffmpeg  # macOS/Linux
where ffmpeg  # Windows

If not found, reinstall and restart your terminal.

CUDA out of memory

Reduce batch size or use smaller models:

Switch Whisper from base to tiny
Use CPU instead: set device = "cpu" in scripts
Process videos in smaller batches

ModuleNotFoundError: No module named 'cv2'

Install OpenCV:

pip install opencv-python

Next Steps

With your environment set up, proceed to:

Feature Extraction

Extract multimodal embeddings from your video collection

Get Started

Core Concepts

Guides

Advanced

Prerequisites

System Dependencies

Directory Structure

Configuration

GPU Acceleration

Troubleshooting

Next Steps

Feature Extraction

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Advanced

Documentation Index

​Prerequisites

​System Dependencies

​Directory Structure

​Configuration

​GPU Acceleration

​Troubleshooting

​Next Steps

Feature Extraction

Build docs developers (and LLMs) love

Prerequisites

System Dependencies

Directory Structure

Configuration

GPU Acceleration

Troubleshooting

Next Steps