Skip to main content

Overview

This guide covers training both DOVER++ and V-JEPA2 models for video quality assessment. Both models support fine-tuning on custom datasets with comprehensive configuration options.

Data Preparation

Before training, ensure your data follows the required structure:
1

Organize Directory Structure

Set up your data directory:
data/
├── train/
   ├── labels/
   └── train_labels.csv
   └── videos/
       ├── video001.mp4
       ├── video002.mp4
       └── ...
├── val/
   ├── labels/
   └── val_labels.csv
   └── videos/
       └── ...
2

Prepare CSV Files

Ensure your CSV files contain the required columns:
video_name,Prompt,Traditional_MOS,Alignment_MOS,Aesthetic_MOS,Temporal_MOS,Overall_MOS
video001.mp4,"A cat playing piano",3.2,4.1,3.8,3.5,3.65
video002.mp4,"Sunset over mountains",4.5,4.2,4.8,4.1,4.4
All MOS scores should be on a 1-5 scale.
3

Verify Video Files

  • Supported formats: .mp4, .avi, .mov, .mkv
  • Videos will be automatically sampled to 64 frames
  • Resolution is automatically adjusted per model

Training Commands

DOVER++ Model

python scripts/train.py \
    --model dover \
    --data path/to/your/data \
    --epochs 5 \
    --batch-size 4 \
    --lr 1e-4 \
    --output models/
DOVER++ Configuration:
  • Resolution: 640×640
  • Frames: 64 per video
  • Default batch size: 4
  • Default learning rate: 1e-4
  • Gradient accumulation: 8 steps (effective batch size: 32)
  • Text encoder: BAAI/bge-large-en-v1.5

V-JEPA2 Model

python scripts/train.py \
    --model vjepa \
    --data path/to/your/data \
    --epochs 10 \
    --batch-size 6 \
    --lr 2e-4 \
    --output models/
V-JEPA2 Configuration:
  • Resolution: 384×384
  • Frames: 64 per video
  • Default batch size: 6
  • Default learning rate: 2e-4
  • Gradient accumulation: 32 steps (effective batch size: 192)
  • Freeze ratio: 0.85 (85% of layers frozen)
  • Video encoder: facebook/vjepa-vit-giant-p16

Configuration Options

Command-Line Arguments

ArgumentDescriptionDefaultRequired
--modelModel type: dover or vjepa-Yes
--dataPath to data directory-Yes
--epochsNumber of training epochs5No
--batch-sizeBatch size per GPUModel defaultNo
--lrLearning rateModel defaultNo
--outputOutput directory for checkpointsmodelsNo
--resumeResume from checkpoint path-No
--wandbEnable Weights & Biases loggingFalseNo

Model-Specific Configuration

Configuration is defined in src/config/config.py:21:
DOVER_CONFIG = {
    "video_resolution": (640, 640),
    "num_frames": 64,
    "batch_size": 4,
    "learning_rate": 1e-4,
    "text_encoder": "BAAI/bge-large-en-v1.5",
    "gradient_accumulation_steps": 8,
    "effective_batch_size": 32
}
The V-JEPA2 model uses discriminative learning rates for different components: text encoder (0.1×), video encoder (0.5×), and quality head (2.0×).

Monitoring Training

Weights & Biases Integration

Enable W&B logging with the --wandb flag (scripts/train.py:62):
python scripts/train.py --model dover --data data/ --wandb
Logged Metrics:
  • train_loss: Training loss per epoch
  • val_loss: Validation loss per epoch
  • vquala_score: VQualA challenge score (SROCC + PLCC) / 2
  • best_score: Best validation score achieved
  • Learning rate schedule
  • Gradient norms

Console Output

The training script provides real-time progress:
Training DOVER model
Epochs: 5
Batch size: 4
Learning rate: 0.0001

Epoch 1/5
  Batch 50/500, Loss: 0.3421, Memory: 11.2GB
  Batch 100/500, Loss: 0.2983, Memory: 11.3GB
  
Epoch 1 Results:
  Train Loss: 0.2845
  Val Loss: 0.3012
  VQualA Score: 0.7234
  New best model saved! Score: 0.7234

Loss Components

The hybrid loss function (src/utils/training.py:90) combines three components:
  1. Smooth L1 Loss (β=0.1): Basic regression loss
  2. Ranking Loss (margin=0.2): Preserves relative quality ordering
  3. Scale-Aware Loss: Emphasizes extreme quality values
# Adaptive weighting
alpha = 0.7  # Smooth L1 weight
beta = 0.3   # Ranking weight
gamma = 0.1  # Scale weight

total_loss = alpha * smooth_l1 + beta * ranking + gamma * scale
Weights adapt automatically during training based on loss trends (src/utils/training.py:50).

Expected Training Times

DOVER++ Model

GPUBatch SizeTime per EpochTotal (5 epochs)
A100 (40GB)4~45 min~3.75 hours
V100 (32GB)4~60 min~5 hours
RTX 3090 (24GB)2~90 min~7.5 hours

V-JEPA2 Model

GPUBatch SizeTime per EpochTotal (10 epochs)
A100 (40GB)6~75 min~12.5 hours
V100 (32GB)4~90 min~15 hours
RTX 3090 (24GB)2~120 min~20 hours
Use gradient accumulation to maintain effective batch size when reducing --batch-size for memory constraints.

Resource Requirements

GPU Memory

DOVER++:
  • Minimum: 12GB VRAM (batch size 2)
  • Recommended: 24GB VRAM (batch size 4)
  • Parameters: ~120M
V-JEPA2:
  • Minimum: 16GB VRAM (batch size 2)
  • Recommended: 40GB VRAM (batch size 6)
  • Parameters: ~1.1B (only 15% trainable due to freezing)

Storage

  • Model checkpoints: ~500MB per checkpoint
  • Training logs: ~10MB per run
  • Cache files: ~2GB for text embeddings

Checkpoint Management

Checkpoints are automatically saved (scripts/train.py:139):
torch.save({
    'epoch': epoch,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'best_score': best_score,
    'config': config
}, f"{args.output}/{args.model}_best.pt")
Checkpoint Location: models/{model}_best.pt Contents:
  • Model weights
  • Optimizer state
  • Best validation score
  • Model configuration
  • Current epoch

Troubleshooting

Out of Memory

Reduce batch size:
python scripts/train.py --model vjepa --batch-size 2 --data data/
The gradient accumulation steps automatically maintain the effective batch size.

Slow Training

  1. Check data loading: Ensure videos are on fast storage (SSD)
  2. Increase workers: Set num_workers=4 in src/config/config.py:73
  3. Enable mixed precision: Enabled by default (src/config/config.py:62)

NaN Loss

Reduce learning rate:
python scripts/train.py --model dover --lr 5e-5 --data data/

Next Steps

Evaluation

Evaluate your trained models

Memory Optimization

Optimize GPU memory usage

Build docs developers (and LLMs) love