Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/AymanMahfuz27/tiktok-auto-collection-sorter/llms.txt

Use this file to discover all available pages before exploring further.

This guide covers setting up your development environment, installing dependencies, and organizing your data directory structure.

Prerequisites

Before starting, ensure you have:
  • Python 3.8 or higher
  • FFmpeg (required for audio extraction)
  • 4GB+ available RAM (8GB+ recommended for GPU acceleration)
  • Optional: NVIDIA GPU with CUDA support for faster processing

System Dependencies

1

Install FFmpeg

FFmpeg is required for extracting audio from video files.macOS:
brew install ffmpeg
Ubuntu/Debian:
sudo apt update
sudo apt install ffmpeg
Windows: Download from ffmpeg.org and add to PATH.Verify installation:
ffmpeg -version
2

Install Python Dependencies

The project uses PyTorch, CLIP, Whisper, and FastAPI. Install all dependencies:
pip install torch torchvision torchaudio
pip install openai-whisper
pip install git+https://github.com/openai/CLIP.git
pip install opencv-python pillow numpy
pip install scikit-learn
pip install fastapi uvicorn
pip install tqdm
For GPU acceleration, install PyTorch with CUDA support following instructions at pytorch.org
3

Verify Installation

Test that all models can be loaded:
import torch
import clip
import whisper

# Check device
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# Load CLIP
clip_model, preprocess = clip.load("ViT-B/32", device=device)
print("CLIP loaded successfully")

# Load Whisper
whisper_model = whisper.load_model("base", device=device)
print("Whisper loaded successfully")
Expected output:
Using device: cuda  # or cpu
CLIP loaded successfully
Whisper loaded successfully

Directory Structure

The project expects the following directory layout:
tiktok-sorter/
├── extract_features.py
├── train.py
├── predict.py
├── server.py
├── index.html
├── data/
│   └── Favorites/
│       └── videos/
│           ├── 123456789.mp4        # Unsorted videos (root level)
│           ├── 987654321.mp4
│           ├── soccer/               # Category folder
│           │   ├── 111111111.mp4
│           │   └── 222222222.mp4
│           ├── cooking/
│           │   └── 333333333.mp4
│           └── funny/
│               └── 444444444.mp4
└── artifacts/                        # Created automatically
    ├── labeled_embeddings.pt
    ├── unlabeled_embeddings.pt
    ├── transcripts.json
    ├── model.pt
    ├── model_config.json
    └── predictions.json
1

Create Directory Structure

mkdir -p data/Favorites/videos
mkdir -p artifacts
2

Organize Your Videos

Place your TikTok videos in the appropriate locations:Labeled Videos (for training):
  • Create a subfolder for each category: data/Favorites/videos/[category-name]/
  • Move videos into their respective category folders
  • Examples: soccer/, cooking/, funny/, motivational/
Unlabeled Videos (for prediction):
  • Place directly in data/Favorites/videos/
  • These will be automatically sorted by the model
You need at least 5-10 labeled videos per category for meaningful training results. Categories with fewer examples may not be learned effectively.
3

Verify Data Structure

Check your setup:
ls -R data/Favorites/videos/
Expected output:
data/Favorites/videos/:
123456789.mp4  987654321.mp4  soccer/  cooking/  funny/

data/Favorites/videos/soccer:
111111111.mp4  222222222.mp4

data/Favorites/videos/cooking:
333333333.mp4

data/Favorites/videos/funny:
444444444.mp4

Configuration

The scripts use hardcoded paths relative to the script location. If you need to customize paths, edit these constants: extract_features.py:
DATA_DIR = Path(__file__).parent / "data" / "Favorites" / "videos"
OUTPUT_DIR = Path(__file__).parent / "artifacts"
N_FRAMES = 5              # Number of frames to sample per video
CLIP_MODEL = "ViT-B/32"   # CLIP model variant
WHISPER_MODEL = "base"    # Whisper model size (tiny/base/small/medium/large)
train.py, predict.py, server.py:
ARTIFACTS_DIR = Path(__file__).parent / "artifacts"
DATA_DIR = Path(__file__).parent / "data" / "Favorites" / "videos"
CLIP Models:
  • ViT-B/32 - Default, balanced speed/accuracy (512-d embeddings)
  • ViT-B/16 - Higher accuracy, slower
  • RN50 - ResNet-50 backbone alternative
Whisper Models:
  • tiny - Fastest, least accurate (~1GB VRAM)
  • base - Default, good balance (~1GB VRAM)
  • small - Better transcription (~2GB VRAM)
  • medium - High accuracy (~5GB VRAM)
  • large - Best quality (~10GB VRAM)
Larger models improve feature quality but increase extraction time significantly.

GPU Acceleration

The scripts automatically detect and use CUDA if available:
# Check if PyTorch can see your GPU
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
python -c "import torch; print(f'GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'None'}')"
Performance comparison (600 videos):
  • CPU: ~45-60 minutes for feature extraction
  • GPU (RTX 3060): ~8-12 minutes for feature extraction
Training is fast (~5-30 seconds) regardless of device since it only trains on extracted features.

Troubleshooting

CLIP must be installed from GitHub, not PyPI:
pip install git+https://github.com/openai/CLIP.git
Ensure FFmpeg is in your PATH:
which ffmpeg  # macOS/Linux
where ffmpeg  # Windows
If not found, reinstall and restart your terminal.
Reduce batch size or use smaller models:
  • Switch Whisper from base to tiny
  • Use CPU instead: set device = "cpu" in scripts
  • Process videos in smaller batches
Install OpenCV:
pip install opencv-python

Next Steps

With your environment set up, proceed to:

Feature Extraction

Extract multimodal embeddings from your video collection

Build docs developers (and LLMs) love