DOVER++

Overview

DOVER++ is a state-of-the-art video quality assessment model that combines a ConvNeXt 3D backbone with quality-aware fusion for multi-modal assessment. The implementation includes separate heads for aesthetic and technical quality, with cross-modal attention for combining video and text features.

DOVERModel

The main DOVER++ model class that integrates video encoding, text understanding, quality-aware fusion, and MOS prediction.

Constructor

DOVERModel(
    dover_weights_path: str = "models/DOVER_plus_plus.pth",
    text_encoder_name: str = "BAAI/bge-large-en-v1.5",
    device: str = 'cuda'
)

dover_weights_path

str

default:"models/DOVER_plus_plus.pth"

Path to DOVER++ pretrained weights file. If the file doesn’t exist, it will be automatically downloaded from HuggingFace.

text_encoder_name

str

default:"BAAI/bge-large-en-v1.5"

HuggingFace model ID for the text encoder. Uses BGE-Large by default, with fallback to all-MiniLM-L6-v2 if loading fails.

device

str

default:"cuda"

Device to place the model on (‘cuda’ or ‘cpu’).

Methods

forward

Forward pass through the complete model with video frames and text prompts.

forward(
    frames: torch.Tensor,
    prompts: List[str]
) -> torch.Tensor

frames

torch.Tensor

required

Video frames tensor with shape (B, C, T, H, W) where:

B = batch size
C = channels (3 for RGB)
T = number of frames
H = height
W = width

prompts

List[str]

required

List of text prompts corresponding to each video in the batch.

return

torch.Tensor

MOS predictions with shape (B, 5) containing [Traditional, Alignment, Aesthetic, Temporal, Overall] quality scores.

get_quality_weights

Extract quality aspect weights for given text prompts.

get_quality_weights(
    prompts: List[str]
) -> torch.Tensor

prompts

List[str]

required

List of text prompts to analyze for quality aspects.

return

torch.Tensor

Quality weights tensor with shape (B, 4) representing [Traditional, Alignment, Aesthetic, Temporal] aspect weights.

extract_features

Extract intermediate features from video and text without MOS prediction.

extract_features(
    frames: torch.Tensor,
    prompts: List[str]
) -> Dict[str, torch.Tensor]

frames

torch.Tensor

required

Video frames tensor with shape (B, C, T, H, W).

prompts

List[str]

required

List of text prompts corresponding to each video.

return

Dict[str, torch.Tensor]

Dictionary containing:

dover_features: DOVER feature vectors (B, 1024)
text_features: Text embeddings (B, text_dim)
fused_features: Quality-aware fused features (B, 256)
quality_weights: Quality aspect weights (B, 4)
aesthetic_score: Aesthetic quality scores (B, 1)
technical_score: Technical quality scores (B, 1)

Usage Example

import torch
from src.models.dover_model import DOVERModel

# Initialize model
model = DOVERModel(
    dover_weights_path="models/DOVER_plus_plus.pth",
    text_encoder_name="BAAI/bge-large-en-v1.5",
    device='cuda'
)
model.eval()

# Prepare video frames (B=2, C=3, T=16, H=224, W=224)
frames = torch.randn(2, 3, 16, 224, 224).cuda()

# Text prompts
prompts = [
    "A person walking in a park",
    "Ocean waves crashing on the shore"
]

# Get MOS predictions
with torch.no_grad():
    mos_scores = model(frames, prompts)
    print(f"MOS scores: {mos_scores}")  # Shape: (2, 5)
    
# Extract quality weights
quality_weights = model.get_quality_weights(prompts)
print(f"Quality weights: {quality_weights}")  # Shape: (2, 4)

# Extract features
features = model.extract_features(frames, prompts)
print(f"DOVER features shape: {features['dover_features'].shape}")
print(f"Fused features shape: {features['fused_features'].shape}")

DOVERModelSimple

Simplified DOVER++ model with ConvNeXt 3D backbone, compatible with pretrained weights.

Constructor

DOVERModelSimple(
    device: str = 'cuda'
)

device

str

default:"cuda"

Device to place the model on (‘cuda’ or ‘cpu’).

Methods

forward

Forward pass through DOVER backbone and quality heads.

forward(
    x: torch.Tensor
) -> Dict[str, torch.Tensor]

torch.Tensor

required

Video tensor with shape (B, C, T, H, W).

return

Dict[str, torch.Tensor]

Dictionary containing:

features: Feature vectors for fusion (B, 1024)
aesthetic_score: Aesthetic quality scores (B, 1)
technical_score: Technical quality scores (B, 1)
backbone_features: Raw backbone features (B, 768, T’, H’, W’)

Usage Example

import torch
from src.models.dover_model import DOVERModelSimple, DOVERModelLoader

# Load pretrained model
model = DOVERModelLoader.load_dover_model(
    weights_path="models/DOVER_plus_plus.pth",
    device='cuda'
)
model.eval()

# Process video
frames = torch.randn(1, 3, 16, 224, 224).cuda()

with torch.no_grad():
    output = model(frames)
    print(f"Aesthetic score: {output['aesthetic_score'].item():.3f}")
    print(f"Technical score: {output['technical_score'].item():.3f}")
    print(f"Features shape: {output['features'].shape}")

QualityAwareFusion

Quality-aware fusion module for combining DOVER++ and text features using cross-modal attention.

Constructor

QualityAwareFusion(
    dover_dim: int = 1024,
    text_dim: int = 1024,
    hidden_dim: int = 512
)

dover_dim

int

default:"1024"

Dimension of DOVER feature vectors.

text_dim

int

default:"1024"

Dimension of text feature vectors.

hidden_dim

int

default:"512"

Hidden dimension for cross-modal attention and fusion layers.

Methods

forward

Fuse DOVER++ and text features with quality-aware attention.

forward(
    dover_features: torch.Tensor,
    text_features: torch.Tensor
) -> Tuple[torch.Tensor, torch.Tensor]

dover_features

torch.Tensor

required

DOVER features with shape (B, dover_dim).

text_features

torch.Tensor

required

Text features with shape (B, text_dim).

return

Tuple[torch.Tensor, torch.Tensor]

Tuple of (fused_features, quality_weights) where:

fused_features: Shape (B, hidden_dim // 2)
quality_weights: Shape (B, 4) - weights for 4 quality aspects

Usage Example

import torch
from src.models.dover_model import QualityAwareFusion

# Initialize fusion module
fusion = QualityAwareFusion(
    dover_dim=1024,
    text_dim=1024,
    hidden_dim=512
).cuda()

# Create sample features
dover_features = torch.randn(4, 1024).cuda()
text_features = torch.randn(4, 1024).cuda()

# Fuse features
fused, weights = fusion(dover_features, text_features)
print(f"Fused features shape: {fused.shape}")  # (4, 256)
print(f"Quality weights: {weights}")  # (4, 4)

MOSPredictor

MOS prediction head that predicts 4 quality aspect scores plus an overall quality score.

Constructor

MOSPredictor(
    input_dim: int,
    hidden_dim: int = 256
)

input_dim

int

required

Input feature dimension.

hidden_dim

int

default:"256"

Hidden layer dimension.

Methods

forward

Predict MOS scores from fused features.

forward(
    features: torch.Tensor
) -> torch.Tensor

features

torch.Tensor

required

Input features with shape (B, input_dim).

return

torch.Tensor

MOS predictions with shape (B, 5) containing [Traditional, Alignment, Aesthetic, Temporal, Overall] scores.

Usage Example

import torch
from src.models.dover_model import MOSPredictor

# Initialize predictor
predictor = MOSPredictor(
    input_dim=256,
    hidden_dim=256
).cuda()

# Predict MOS scores
features = torch.randn(4, 256).cuda()
mos_scores = predictor(features)
print(f"MOS scores shape: {mos_scores.shape}")  # (4, 5)
print(f"Traditional: {mos_scores[:, 0]}")
print(f"Alignment: {mos_scores[:, 1]}")
print(f"Aesthetic: {mos_scores[:, 2]}")
print(f"Temporal: {mos_scores[:, 3]}")
print(f"Overall: {mos_scores[:, 4]}")

Architecture Details

ConvNeXt 3D Backbone

The DOVER++ model uses a ConvNeXt 3D backbone with the following architecture:

Stem: Conv3d(3→96, kernel=1×4×4, stride=1×4×4)
Stage 1: 3 ConvNeXt blocks (96 channels) → Downsample to 192
Stage 2: 3 ConvNeXt blocks (192 channels) → Downsample to 384
Stage 3: 9 ConvNeXt blocks (384 channels) → Downsample to 768
Stage 4: 3 ConvNeXt blocks (768 channels)

Quality Aspects

The model predicts scores for 4 quality aspects plus overall quality:

Traditional Quality: Standard video quality metrics (blur, noise, compression artifacts)
Alignment: Semantic alignment between video content and text prompt
Aesthetic Quality: Visual appeal and artistic quality
Temporal Consistency: Smoothness and coherence across frames
Overall Quality: Weighted combination of all aspects

Quality-Aware Fusion

The fusion module uses:

Quality Classifier: Analyzes text to determine focus areas (4 quality aspects)
Cross-Modal Attention: 8-head attention for video-text alignment
Feature Projection: Projects features to common hidden dimension
Fusion Layers: LayerNorm + MLP for final feature combination

Models

Utilities

Configuration

Overview

DOVERModel

Constructor

Methods

forward

get_quality_weights

extract_features

Usage Example

DOVERModelSimple

Constructor

Methods

forward

Usage Example

QualityAwareFusion

Constructor

Methods

forward

Usage Example

MOSPredictor

Constructor

Methods

forward

Usage Example

Architecture Details

ConvNeXt 3D Backbone

Quality Aspects

Quality-Aware Fusion

Build docs developers (and LLMs) love

Models

Utilities

Configuration

​Overview

​DOVERModel

​Constructor

​Methods

​forward

​get_quality_weights

​extract_features

​Usage Example

​DOVERModelSimple

​Constructor

​Methods

​forward

​Usage Example

​QualityAwareFusion

​Constructor

​Methods

​forward

​Usage Example

​MOSPredictor

​Constructor

​Methods

​forward

​Usage Example

​Architecture Details

​ConvNeXt 3D Backbone

​Quality Aspects

​Quality-Aware Fusion

Build docs developers (and LLMs) love

Overview

DOVERModel

Constructor

Methods

forward

get_quality_weights

extract_features

Usage Example

DOVERModelSimple

Constructor

Methods

forward

Usage Example

QualityAwareFusion

Constructor

Methods

forward

Usage Example

MOSPredictor

Constructor

Methods

forward

Usage Example

Architecture Details

ConvNeXt 3D Backbone

Quality Aspects

Quality-Aware Fusion