
Welcome to QualiVision
QualiVision is a state-of-the-art framework for video quality assessment specifically designed for AI-generated content. Built for the VQualA 2025 Challenge, it provides comprehensive quality evaluation across four critical dimensions.Quick Start
Get started with QualiVision in minutes using pre-trained models
Installation
Set up your environment and install dependencies
Models Overview
Learn about DOVER++ and V-JEPA2 architectures
Quality Assessment Dimensions
QualiVision evaluates AI-generated videos across four critical quality dimensions:Temporal Consistency
Temporal Consistency
Measures the coherence and smoothness of motion across video frames. Ensures that objects and scenes maintain logical continuity throughout the video sequence.
Image Fidelity
Image Fidelity
Evaluates visual quality including sharpness, clarity, and absence of artifacts. Assesses the technical quality of individual frames and overall video rendering.
Aesthetic Appeal
Aesthetic Appeal
Analyzes artistic and visual attractiveness including composition, color harmony, and overall visual appeal. Goes beyond technical quality to evaluate subjective beauty.
Text-Video Alignment
Text-Video Alignment
Determines how well the video content corresponds to the input text prompt. Critical for ensuring AI-generated videos match user intentions.
Two Powerful Models
QualiVision provides two complementary state-of-the-art architectures:DOVER++
ConvNeXt 3D-based Architecture
- Cross-modal attention between video and text
- Quality-aware fusion mechanism
- 640×640 resolution, 64 frames
- ~120M parameters, ~12GB memory
- Robust aesthetic/technical quality separation
V-JEPA2
Vision-JEPA2 ViT-Giant Architecture
- Strategic layer freezing (85% frozen)
- Discriminative learning rates
- 384×384 resolution, 64 frames
- ~1.1B parameters, ~16GB memory
- Strong video representation learning
Key Features
Multi-Modal Fusion
Advanced cross-modal attention mechanisms that combine video features with text prompt embeddings using BGE-Large encoder
Hybrid Loss Function
Sophisticated loss combining smooth L1, ranking loss, and scale-aware components for robust training
Pre-trained Models
Ready-to-use checkpoints trained on VQualA 2025 Challenge dataset for immediate evaluation
Efficient Training
Strategic model freezing and memory optimization techniques enable training on consumer GPUs
Real-World Application
VQualA 2025 ChallengeQualiVision is our submission for the VQualA 2025 Challenge at ICCV 2025 Workshops. The framework is designed to handle the TaobaoVD-GC dataset containing thousands of AI-generated videos with comprehensive quality annotations.
Data Format
QualiVision works with structured video datasets:Technical Innovations
Quality-Aware Fusion
Dynamic attention weighting that adapts based on text content, allowing the model to focus on relevant quality aspects for different types of videos.
Strategic Layer Freezing
Freeze 85% of V-JEPA2 layers to maintain pre-trained knowledge while enabling efficient fine-tuning on domain-specific data.
Adaptive Loss Weighting
Dynamically adjusts loss component weights during training to balance different quality objectives.
Performance Metrics
Our models are evaluated using industry-standard video quality metrics:- SROCC: Spearman Rank Order Correlation Coefficient
- PLCC: Pearson Linear Correlation Coefficient
- VQualA Score: Custom challenge metric combining multiple quality dimensions
| Model | Parameters | Memory | Resolution |
|---|---|---|---|
| DOVER++ | ~120M | ~12GB | 640×640 |
| V-JEPA2 | ~1.1B | ~16GB | 384×384 |
Next Steps
Try the Quickstart
Run your first evaluation in under 5 minutes
Install QualiVision
Set up your development environment
