Skip to main content

Overview

DOVER++ (Disentangled Objective Video Quality Evaluator) is a state-of-the-art video quality assessment model that separates aesthetic and technical quality dimensions. QualiVision extends DOVER++ with quality-aware fusion mechanisms for multi-modal understanding.
DOVER++ Key Stats
  • Parameters: ~120 million
  • Input Resolution: 640×640
  • Frames: 64 per video
  • Memory: ~12GB GPU
  • Pretrained: HuggingFace weights available

Architecture Components

1. ConvNeXt 3D Backbone

The backbone uses a modern ConvNeXt architecture adapted for 3D video processing:
From dover_model.py:128-152, the backbone consists of 4 stages:
def _build_convnext_backbone(self) -> nn.Module:
    return nn.Sequential(
        # Stem: 3 -> 96 channels
        nn.Conv3d(3, 96, kernel_size=(1, 4, 4), stride=(1, 4, 4)),
        nn.GroupNorm(1, 96),
        
        # Stage 1: 96 channels, 3 blocks
        *[self._make_convnext_block(96) for _ in range(3)],
        nn.Conv3d(96, 192, kernel_size=(1, 2, 2), stride=(1, 2, 2)),
        
        # Stage 2: 192 channels, 3 blocks
        *[self._make_convnext_block(192) for _ in range(3)],
        nn.Conv3d(192, 384, kernel_size=(1, 2, 2), stride=(1, 2, 2)),
        
        # Stage 3: 384 channels, 9 blocks
        *[self._make_convnext_block(384) for _ in range(9)],
        nn.Conv3d(384, 768, kernel_size=(1, 2, 2), stride=(1, 2, 2)),
        
        # Stage 4: 768 channels, 3 blocks
        *[self._make_convnext_block(768) for _ in range(3)],
    )
Design Highlights:
  • Progressive channel expansion: 96 → 192 → 384 → 768
  • Spatial downsampling with (1, 2, 2) convolutions (preserves temporal)
  • 18 total ConvNeXt blocks with varying depths per stage

2. Disentangled Quality Heads

DOVER++ separates quality into aesthetic and technical dimensions:

Aesthetic Head

Evaluates artistic and visual appeal aspects:
  • Color harmony
  • Composition
  • Visual creativity
  • Artistic style

Technical Head

Assesses technical quality factors:
  • Sharpness
  • Artifacts
  • Noise levels
  • Compression quality
From dover_model.py:99-116:
# Separate heads for aesthetic and technical quality
self.aesthetic_head = nn.Sequential(
    nn.AdaptiveAvgPool3d(1),
    nn.Flatten(),
    nn.Linear(768, 256),
    nn.ReLU(inplace=True),
    nn.Dropout(0.1),
    nn.Linear(256, 1)
)

self.technical_head = nn.Sequential(
    nn.AdaptiveAvgPool3d(1),
    nn.Flatten(),
    nn.Linear(768, 256),
    nn.ReLU(inplace=True),
    nn.Dropout(0.1),
    nn.Linear(256, 1)
)

3. Quality-Aware Fusion Mechanism

The fusion module combines DOVER++ video features with text embeddings using cross-modal attention:
1

Quality Aspect Classification

Analyze text prompt to determine which quality aspects are emphasized:From dover_model.py:211-218:
self.quality_classifier = nn.Sequential(
    nn.Linear(text_dim, hidden_dim),
    nn.ReLU(inplace=True),
    nn.Dropout(0.1),
    nn.Linear(hidden_dim, 4),  # 4 quality aspects
    nn.Softmax(dim=-1)
)
Outputs 4 weights for: Traditional, Alignment, Aesthetic, Temporal
2

Feature Projection

Project both modalities to common dimension:
self.dover_proj = nn.Linear(dover_dim, hidden_dim)  # 1024 -> 512
self.text_proj = nn.Linear(text_dim, hidden_dim)    # 1024 -> 512
3

Cross-Modal Attention

From dover_model.py:220-226:
self.cross_attention = nn.MultiheadAttention(
    embed_dim=hidden_dim,
    num_heads=8,
    dropout=0.1,
    batch_first=True
)
Text queries attend to video features to extract relevant quality information
4

Feature Fusion

From dover_model.py:265-277:
# Cross-modal attention
attended_dover, _ = self.cross_attention(
    query=text_proj_seq,
    key=dover_proj_seq,
    value=dover_proj_seq
)

# Concatenate and fuse
combined_features = torch.cat([attended_dover, text_proj], dim=-1)
fused_features = self.fusion_layer(combined_features)
Why Cross-Modal Attention? Text prompts like “smooth camera motion” or “vibrant colors” guide the model to focus on specific quality aspects. The attention mechanism allows the model to selectively emphasize relevant video features based on the text description.

4. MOS Prediction Head

The final prediction head generates 5 MOS scores from fused features: From dover_model.py:286-303:
class MOSPredictor(nn.Module):
    def __init__(self, input_dim: int, hidden_dim: int = 256):
        super().__init__()
        
        self.predictor = nn.Sequential(
            nn.LayerNorm(input_dim),
            nn.Linear(input_dim, hidden_dim),
            nn.GELU(),
            nn.Dropout(0.15),
            nn.Linear(hidden_dim, hidden_dim // 2),
            nn.GELU(),
            nn.Dropout(0.15),
            nn.Linear(hidden_dim // 2, 5)  # 4 sub-MOS + Overall
        )
Output Scores:
  1. Traditional MOS: Overall technical quality
  2. Alignment MOS: Text-video correspondence
  3. Aesthetic MOS: Visual appeal
  4. Temporal MOS: Motion smoothness
  5. Overall MOS: Weighted combination

Forward Pass

The complete forward pass integrates all components:
From dover_model.py:372-402:
def forward(self, frames: torch.Tensor, prompts: List[str]) -> torch.Tensor:
    """
    Args:
        frames: Video frames tensor (B, C, T, H, W)
        prompts: List of text prompts
        
    Returns:
        MOS predictions (B, 5) - [Traditional, Alignment, Aesthetic, Temporal, Overall]
    """
    # Extract DOVER++ features
    dover_output = self.dover_model(frames)
    dover_features = dover_output['features']  # (B, 1024)
    
    # Extract text features
    with torch.no_grad():
        text_features = self.text_encoder.encode(
            prompts,
            convert_to_tensor=True,
            normalize_embeddings=True,
            device=self.device
        )  # (B, 1024)
    
    # Quality-aware fusion
    fused_features, quality_weights = self.fusion(dover_features, text_features)
    # fused_features: (B, 256)
    # quality_weights: (B, 4)
    
    # Predict MOS scores
    mos_predictions = self.mos_predictor(fused_features)  # (B, 5)
    
    return mos_predictions

Technical Specifications

ParameterValue
Total Parameters~120M
Trainable Parameters~120M
Input Resolution640×640
Input Frames64
Backbone Channels768
Feature Dimension1024
Hidden Dimension512
Output Dimension5 (MOS scores)

Pretrained Weights

DOVER++ uses pretrained weights from the original DOVER project: From dover_model.py:36-43:
# Download weights if not exists
if not os.path.exists(weights_path):
    os.makedirs(os.path.dirname(weights_path), exist_ok=True)
    print(f"Downloading DOVER++ weights to {weights_path}")
    urllib.request.urlretrieve(
        "https://huggingface.co/teowu/DOVER/resolve/main/DOVER_plus_plus.pth",
        weights_path
    )
The pretrained weights are loaded with strict=False to allow for architecture modifications in the fusion and prediction heads. Only the ConvNeXt backbone weights are transferred.

Feature Extraction API

Extract intermediate features for analysis:
features = model.extract_features(frames, prompts)
# Returns:
# {
#     'dover_features': (B, 1024),
#     'text_features': (B, 1024),
#     'fused_features': (B, 256),
#     'quality_weights': (B, 4),
#     'aesthetic_score': (B, 1),
#     'technical_score': (B, 1)
# }

Advantages

Disentangled Quality

Separates aesthetic and technical aspects for interpretable assessment

Cross-Modal Fusion

Attention mechanism aligns video features with text guidance

Pretrained Backbone

Leverages high-quality pretrained weights from DOVER project

Quality-Aware

Dynamically weights quality aspects based on text prompt

V-JEPA2 Model

Alternative architecture with ViT backbone

Quality Dimensions

Understanding the 4 quality metrics

Architecture

Overall system architecture

Build docs developers (and LLMs) love