Evaluation Suite Overview

The evaluation suite is a fully script-driven, reproducible pipeline for assessing both audio (GCC-PHAT) and visual (motion-based) synchronization methods across multiple dimensions.

Purpose

The evaluation suite enables quantitative analysis of:

Accuracy: How closely do estimated offsets match ground truth?
Confidence Reliability: Do confidence scores correlate with error magnitude?
Cross-Method Agreement: How consistently do audio and visual methods agree?
Efficiency: Runtime performance and resource usage characteristics
Sensitivity: How do different video characteristics (motion level, audio energy, length) affect performance?

Architecture

The evaluation pipeline consists of four independent stages:

Offset Generation

Generate synthetic test cases by applying known temporal shifts to original videos

Produces 24 test cases from 4 source videos × 6 offsets
Creates ground truth metadata with sensitivity tags

Batch Synchronization

Run both audio and visual sync on every test case

48 synchronization runs total (24 cases × 2 methods)
Captures accuracy, confidence, runtime, and resource usage

Metrics Computation

Aggregate results into statistical summaries

MAE, RMSE, cross-method agreement
Confidence validation (Pearson correlation)
Efficiency and resource analysis

Visualization

Generate publication-ready plots

8 plot types covering accuracy, confidence, efficiency
Per-case diagnostics (motion signals, timelines)

Directory Structure

evaluation/
├── originals/          # Place your 4 source videos here
├── synthetic/          # Generated offset-shifted videos
├── metadata/
│   └── synthetic_metadata.csv
├── results/
│   └── results.csv
├── metrics/
│   └── metrics_summary.json
├── plots/              # Publication-ready PNG plots
│
├── offset_generation.py
├── run_batch.py
├── compute_metrics.py
└── visualize_results.py

Key Features

Ground Truth Validation

Synthetic dataset with known offsets enables precise accuracy measurement without manual annotation

Comprehensive Metrics

Tracks accuracy, confidence, efficiency, and resource usage across all test cases

Reproducible

Fully scripted pipeline ensures consistent, repeatable results

Publication-Ready

High-DPI plots with regression analysis, boxplots, and per-case diagnostics

Sensitivity Tags

Each test case is annotated with metadata to enable sensitivity analysis:

Tag	Description	Use Case
`motion_level`	Mean frame-to-frame pixel difference (0-1)	Assess visual sync performance on low-motion vs high-motion videos
`audio_energy_level`	RMS energy of audio signal (0-1)	Assess audio sync performance on quiet vs loud recordings
`video_length_sec`	Duration of the original video	Analyze runtime scaling with video length

Sensitivity tags are computed automatically during offset generation by sampling the first 10 seconds (audio) or 300 frames (motion) of each video.

Quick Start

# 1. Place 4 source videos in evaluation/originals/

# 2. Run the complete pipeline
python -m evaluation.offset_generation
python -m evaluation.run_batch
python -m evaluation.compute_metrics
python -m evaluation.visualize_results

Output Artifacts

synthetic_metadata.csv

24-row CSV with ground truth offsets and sensitivity tags for each test case.Columns: video_id, synthetic_file_path, true_offset_ms, video_length_sec, motion_level, audio_energy_level

results.csv

48-row CSV with per-method synchronization results.Columns: video_id, true_offset_ms, method_type, estimated_offset_ms, absolute_error_ms, confidence_score, runtime_seconds, peak_cpu_percent, peak_memory_mb, plus sensitivity tags

metrics_summary.json

Structured JSON with aggregated statistics across five categories: accuracy, cross_method_agreement, confidence_validation, efficiency, resource_usage, grouped_by_tag.

plots/ (8 PNG files + subdirectories)

High-DPI (300 DPI) publication-ready visualizations covering all evaluation dimensions.

All scripts use logging to stdout for progress tracking. Intermediate results are cached so you can re-run later stages without regenerating earlier artifacts.

Next Steps

Workflow

Detailed walkthrough of each pipeline stage

Metrics

Complete reference for all computed metrics

Visualization

Gallery and descriptions of all plot types

Get Started

Core Concepts

User Guide

Evaluation Suite

Purpose

Architecture

Directory Structure

Key Features

Ground Truth Validation

Comprehensive Metrics

Reproducible

Publication-Ready

Sensitivity Tags

Quick Start

Output Artifacts

Next Steps

Workflow

Metrics

Visualization

Build docs developers (and LLMs) love

Get Started

Core Concepts

User Guide

Evaluation Suite

​Purpose

​Architecture

​Directory Structure

​Key Features

Ground Truth Validation

Comprehensive Metrics

Reproducible

Publication-Ready

​Sensitivity Tags

​Quick Start

​Output Artifacts

​Next Steps

Workflow

Metrics

Visualization

Build docs developers (and LLMs) love

Purpose

Architecture

Directory Structure

Key Features

Sensitivity Tags

Quick Start

Output Artifacts

Next Steps