Training Models

Overview

This guide covers training both DOVER++ and V-JEPA2 models for video quality assessment. Both models support fine-tuning on custom datasets with comprehensive configuration options.

Data Preparation

Before training, ensure your data follows the required structure:

Organize Directory Structure

Set up your data directory:

data/
├── train/
│   ├── labels/
│   │   └── train_labels.csv
│   └── videos/
│       ├── video001.mp4
│       ├── video002.mp4
│       └── ...
├── val/
│   ├── labels/
│   │   └── val_labels.csv
│   └── videos/
│       └── ...

Prepare CSV Files

Ensure your CSV files contain the required columns:

video_name,Prompt,Traditional_MOS,Alignment_MOS,Aesthetic_MOS,Temporal_MOS,Overall_MOS
video001.mp4,"A cat playing piano",3.2,4.1,3.8,3.5,3.65
video002.mp4,"Sunset over mountains",4.5,4.2,4.8,4.1,4.4

All MOS scores should be on a 1-5 scale.

Verify Video Files

Supported formats: .mp4, .avi, .mov, .mkv
Videos will be automatically sampled to 64 frames
Resolution is automatically adjusted per model

Training Commands

DOVER++ Model

python scripts/train.py \
    --model dover \
    --data path/to/your/data \
    --epochs 5 \
    --batch-size 4 \
    --lr 1e-4 \
    --output models/

DOVER++ Configuration:

Resolution: 640×640
Frames: 64 per video
Default batch size: 4
Default learning rate: 1e-4
Gradient accumulation: 8 steps (effective batch size: 32)
Text encoder: BAAI/bge-large-en-v1.5

V-JEPA2 Model

python scripts/train.py \
    --model vjepa \
    --data path/to/your/data \
    --epochs 10 \
    --batch-size 6 \
    --lr 2e-4 \
    --output models/

V-JEPA2 Configuration:

Resolution: 384×384
Frames: 64 per video
Default batch size: 6
Default learning rate: 2e-4
Gradient accumulation: 32 steps (effective batch size: 192)
Freeze ratio: 0.85 (85% of layers frozen)
Video encoder: facebook/vjepa-vit-giant-p16

Configuration Options

Command-Line Arguments

Argument	Description	Default	Required
`--model`	Model type: `dover` or `vjepa`	-	Yes
`--data`	Path to data directory	-	Yes
`--epochs`	Number of training epochs	5	No
`--batch-size`	Batch size per GPU	Model default	No
`--lr`	Learning rate	Model default	No
`--output`	Output directory for checkpoints	`models`	No
`--resume`	Resume from checkpoint path	-	No
`--wandb`	Enable Weights & Biases logging	False	No

Model-Specific Configuration

Configuration is defined in src/config/config.py:21:

DOVER_CONFIG = {
    "video_resolution": (640, 640),
    "num_frames": 64,
    "batch_size": 4,
    "learning_rate": 1e-4,
    "text_encoder": "BAAI/bge-large-en-v1.5",
    "gradient_accumulation_steps": 8,
    "effective_batch_size": 32
}

The V-JEPA2 model uses discriminative learning rates for different components: text encoder (0.1×), video encoder (0.5×), and quality head (2.0×).

Monitoring Training

Weights & Biases Integration

Enable W&B logging with the --wandb flag (scripts/train.py:62):

python scripts/train.py --model dover --data data/ --wandb

Logged Metrics:

train_loss: Training loss per epoch
val_loss: Validation loss per epoch
vquala_score: VQualA challenge score (SROCC + PLCC) / 2
best_score: Best validation score achieved
Learning rate schedule
Gradient norms

Console Output

The training script provides real-time progress:

Training DOVER model
Epochs: 5
Batch size: 4
Learning rate: 0.0001

Epoch 1/5
  Batch 50/500, Loss: 0.3421, Memory: 11.2GB
  Batch 100/500, Loss: 0.2983, Memory: 11.3GB
  
Epoch 1 Results:
  Train Loss: 0.2845
  Val Loss: 0.3012
  VQualA Score: 0.7234
  New best model saved! Score: 0.7234

Loss Components

The hybrid loss function (src/utils/training.py:90) combines three components:

Smooth L1 Loss (β=0.1): Basic regression loss
Ranking Loss (margin=0.2): Preserves relative quality ordering
Scale-Aware Loss: Emphasizes extreme quality values

# Adaptive weighting
alpha = 0.7  # Smooth L1 weight
beta = 0.3   # Ranking weight
gamma = 0.1  # Scale weight

total_loss = alpha * smooth_l1 + beta * ranking + gamma * scale

Weights adapt automatically during training based on loss trends (src/utils/training.py:50).

Expected Training Times

DOVER++ Model

GPU	Batch Size	Time per Epoch	Total (5 epochs)
A100 (40GB)	4	~45 min	~3.75 hours
V100 (32GB)	4	~60 min	~5 hours
RTX 3090 (24GB)	2	~90 min	~7.5 hours

V-JEPA2 Model

GPU	Batch Size	Time per Epoch	Total (10 epochs)
A100 (40GB)	6	~75 min	~12.5 hours
V100 (32GB)	4	~90 min	~15 hours
RTX 3090 (24GB)	2	~120 min	~20 hours

Use gradient accumulation to maintain effective batch size when reducing --batch-size for memory constraints.

Resource Requirements

GPU Memory

DOVER++:

Minimum: 12GB VRAM (batch size 2)
Recommended: 24GB VRAM (batch size 4)
Parameters: ~120M

V-JEPA2:

Minimum: 16GB VRAM (batch size 2)
Recommended: 40GB VRAM (batch size 6)
Parameters: ~1.1B (only 15% trainable due to freezing)

Storage

Model checkpoints: ~500MB per checkpoint
Training logs: ~10MB per run
Cache files: ~2GB for text embeddings

Checkpoint Management

Checkpoints are automatically saved (scripts/train.py:139):

torch.save({
    'epoch': epoch,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'best_score': best_score,
    'config': config
}, f"{args.output}/{args.model}_best.pt")

Checkpoint Location: models/{model}_best.pt Contents:

Model weights
Optimizer state
Best validation score
Model configuration
Current epoch

Troubleshooting

Out of Memory

Reduce batch size:

python scripts/train.py --model vjepa --batch-size 2 --data data/

The gradient accumulation steps automatically maintain the effective batch size.

Slow Training

Check data loading: Ensure videos are on fast storage (SSD)
Increase workers: Set num_workers=4 in src/config/config.py:73
Enable mixed precision: Enabled by default (src/config/config.py:62)

NaN Loss

Reduce learning rate:

python scripts/train.py --model dover --lr 5e-5 --data data/

Get Started

Core Concepts

Guides

Overview

Data Preparation

Training Commands

DOVER++ Model

V-JEPA2 Model

Configuration Options

Command-Line Arguments

Model-Specific Configuration

Monitoring Training

Weights & Biases Integration

Console Output

Loss Components

Expected Training Times

DOVER++ Model

V-JEPA2 Model

Resource Requirements

GPU Memory

Storage

Checkpoint Management

Troubleshooting

Out of Memory

Slow Training

NaN Loss

Next Steps

Evaluation

Memory Optimization

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

​Overview

​Data Preparation

​Training Commands

​DOVER++ Model

​V-JEPA2 Model

​Configuration Options

​Command-Line Arguments

​Model-Specific Configuration

​Monitoring Training

​Weights & Biases Integration

​Console Output

​Loss Components

​Expected Training Times

​DOVER++ Model

​V-JEPA2 Model

​Resource Requirements

​GPU Memory

​Storage

​Checkpoint Management

​Troubleshooting

​Out of Memory

​Slow Training

​NaN Loss

​Next Steps

Evaluation

Memory Optimization

Build docs developers (and LLMs) love

Overview

Data Preparation

Training Commands

DOVER++ Model

V-JEPA2 Model

Configuration Options

Command-Line Arguments

Model-Specific Configuration

Monitoring Training

Weights & Biases Integration

Console Output

Loss Components

Expected Training Times

DOVER++ Model

V-JEPA2 Model

Resource Requirements

GPU Memory

Storage

Checkpoint Management

Troubleshooting

Out of Memory

Slow Training

NaN Loss

Next Steps