Skip to main content

Overview

QualiVision implements a sophisticated data preprocessing pipeline that handles video loading, frame sampling, resolution adaptation, text encoding, and MOS score normalization. The pipeline is optimized for GPU acceleration and supports both DOVER++ and V-JEPA2 models.

Dataset Structure

From README.md:19-43, the expected directory structure:
data/
├── train/
   ├── labels/
   └── train_labels.csv
   └── videos/
       ├── video001.mp4
       ├── video002.mp4
       └── ...
├── val/
   ├── labels/
   └── val_labels.csv
   └── videos/
       ├── val_video001.mp4
       └── ...
└── test/
    ├── test_labels.csv
    └── videos/
        ├── test_video001.mp4
        └── ...

CSV Label Format

From README.md:46-50 and dataset.py:37:
video_name,Prompt,Traditional_MOS,Alignment_MOS,Aesthetic_MOS,Temporal_MOS,Overall_MOS
video001.mp4,"A cat playing piano",3.2,4.1,3.8,3.5,3.65
video002.mp4,"Sunset over mountains",4.5,4.2,4.8,4.1,4.4
Required Columns:
  • video_name: Video filename
  • Prompt: Text description/prompt
  • Traditional_MOS: Technical quality score (1-5)
  • Alignment_MOS: Text-video alignment score (1-5)
  • Aesthetic_MOS: Aesthetic appeal score (1-5)
  • Temporal_MOS: Temporal consistency score (1-5)
  • Overall_MOS: Overall quality score (1-5)

Frame Sampling

All videos are uniformly sampled to 64 frames:
From dataset.py:82-98:
def _sample_frames_manual(self, video_path: Path) -> torch.Tensor:
    vr = VideoReader(str(video_path))
    total_frames = len(vr)
    
    # Sample frames uniformly
    if total_frames <= self.num_frames:
        indices = np.linspace(0, total_frames - 1, self.num_frames).astype(int)
    else:
        indices = np.linspace(0, total_frames - 1, self.num_frames).astype(int)
    
    indices = np.clip(indices, 0, total_frames - 1)
    frames = vr.get_batch(indices)  # Shape: (T, H, W, C)
Process:
  1. Count total frames in video
  2. Generate 64 evenly-spaced indices using np.linspace
  3. Clip indices to valid range [0, total_frames-1]
  4. Extract frames using Decord’s batch API

Resolution Adaptation

Different models require different input resolutions:
From dataset.py:64-69 and config.py:23:
# Transform pipeline
self.transform = transforms.Compose([
    transforms.ToPILImage(),
    transforms.Resize((640, 640)),  # DOVER++ resolution
    transforms.ToTensor(),  # Converts to [0,1] range
])
Process per frame:
  1. Convert numpy array → PIL Image
  2. Resize to 640×640 (bicubic interpolation by default)
  3. Convert to PyTorch tensor (C, H, W)
  4. Normalize to [0, 1] range
Final shape: (B, C=3, T=64, H=640, W=640)

Text Processing with BGE-Large

Text prompts are encoded to dense embeddings:
From config.py:25,41:
DOVER_CONFIG = {"text_encoder": "BAAI/bge-large-en-v1.5"}
VJEPA_CONFIG = {"text_encoder": "BAAI/bge-large-en-v1.5"}
BGE-Large (BAAI General Embedding):
  • Model: Sentence Transformer
  • Architecture: BERT-based encoder
  • Dimension: 1024 (DOVER++) or 768 (V-JEPA2)
  • Training: Contrastive learning on text pairs

MOS Score Normalization

From dataset.py:182-189:
if self.has_labels:
    # Training/validation mode - include labels
    labels = pd.to_numeric(row[self.MOS_COLS], errors="coerce").fillna(3.0).astype(np.float32).values
    result["labels"] = torch.tensor(labels, dtype=torch.float32)
else:
    # Test mode - no labels available
    result["labels"] = torch.zeros(5, dtype=torch.float32)
Steps:
  1. Parse: Convert CSV strings to numeric values
  2. Handle missing: Fill NaN with 3.0 (neutral score)
  3. Cast: Convert to float32 for GPU efficiency
  4. Tensorize: Create PyTorch tensor
Range: 1.0 to 5.0 (MOS scale)From README.md:56:
- **Quality Normalization**: MOS scores standardized to 1-5 scale

Data Loading Pipeline

TaobaoVDDataset Class

From dataset.py:29-190:
class TaobaoVDDataset(Dataset):
    MOS_COLS = ['Traditional_MOS', 'Alignment_MOS', 'Aesthetic_MOS', 'Temporal_MOS', 'Overall_MOS']
    
    def __init__(self, 
                 csv_file: str,
                 video_dir: str,
                 num_frames: int = 64,
                 resolution: int = 640,
                 mode: str = 'train',
                 video_processor: Optional[AutoVideoProcessor] = None):
        self.df = pd.read_csv(csv_file)
        self.video_dir = Path(video_dir)
        self.num_frames = num_frames
        self.resolution = resolution
        self.mode = mode
        self.video_processor = video_processor
        
        # Video transforms
        self.transform = transforms.Compose([
            transforms.ToPILImage(),
            transforms.Resize((resolution, resolution)),
            transforms.ToTensor(),
        ])
        
        # Check if we have ground truth labels
        self.has_labels = all(col in self.df.columns for col in self.MOS_COLS)
Features:
  • Supports both manual transforms (DOVER++) and video processor (V-JEPA2)
  • Automatic label detection
  • Flexible resolution
  • Mode-aware (train/val/test)

GPU-Optimized Collation

From dataset.py:193-288:
class OptimizedGPUCollate:
    def __init__(self, 
                 video_processor: Optional[AutoVideoProcessor] = None,
                 text_encoder: Optional[SentenceTransformer] = None,
                 device: str = 'cuda',
                 max_frames: int = 64):
        self.video_processor = video_processor
        self.text_encoder = text_encoder
        self.device = device
        self.max_frames = max_frames
Purpose: Custom collation function that:
  • Batches video frames
  • Encodes text prompts on GPU
  • Moves data to GPU early
  • Cleans up memory

DataLoader Configuration

From dataset.py:291-375:
train_loader = DataLoader(
    train_dataset,
    batch_size=4,              # DOVER++: 4, V-JEPA2: 6
    shuffle=True,              # Shuffle for training
    collate_fn=collate_fn,     # Custom GPU collation
    num_workers=4,             # Parallel data loading
    pin_memory=True,           # Faster CPU→GPU transfer
    persistent_workers=True    # Keep workers alive
)
From config.py:31,47:
DOVER_CONFIG = {"batch_size": 4}
VJEPA_CONFIG = {"batch_size": 6}

Complete Data Flow

1

CSV Loading

df = pd.read_csv(csv_file)
# Columns: video_name, Prompt, Traditional_MOS, ..., Overall_MOS
2

Video Reading

vr = VideoReader(str(video_path))
frames = vr.get_batch(indices)  # (64, H, W, 3)
3

Frame Transformation

for frame in frames:
    transformed = transform(frame)  # Resize + ToTensor
video_tensor = stack(transformed)  # (3, 64, H, W)
4

Batching

batch_frames = stack([item['frames'] for item in batch])  # (B, 3, 64, H, W)
batch_frames = batch_frames.to('cuda')
5

Text Encoding

text_emb = text_encoder.encode(prompts, device='cuda')  # (B, 1024)
6

Label Extraction

labels = torch.tensor(mos_scores, dtype=float32)  # (B, 5)
7

Return Batch

return {
    "pixel_values_videos": batch_frames,
    "text_emb": text_emb,
    "labels": labels,
    "prompts": prompts,
    "video_names": video_names
}

Performance Considerations

Common bottlenecks in video data loading:
  1. Disk I/O: Reading video files from disk
    • Solution: Use SSD storage, prefetch with num_workers
  2. Video decoding: Decompressing video codecs
    • Solution: Decord with GPU acceleration
  3. Frame transformation: Resizing and converting
    • Solution: Batch transforms, use GPU if possible
  4. Text encoding: BERT forward pass
    • Solution: Encode in collate_fn (parallel with training)
  5. CPU→GPU transfer: Moving data to GPU
    • Solution: pin_memory=True, early GPU transfer

Architecture

How preprocessed data flows through models

DOVER++ Model

640×640 resolution requirements

V-JEPA2 Model

384×384 resolution and video processor

Build docs developers (and LLMs) love