Skip to main content
The evaluation suite is a fully script-driven, reproducible pipeline for assessing both audio (GCC-PHAT) and visual (motion-based) synchronization methods across multiple dimensions.

Purpose

The evaluation suite enables quantitative analysis of:
  • Accuracy: How closely do estimated offsets match ground truth?
  • Confidence Reliability: Do confidence scores correlate with error magnitude?
  • Cross-Method Agreement: How consistently do audio and visual methods agree?
  • Efficiency: Runtime performance and resource usage characteristics
  • Sensitivity: How do different video characteristics (motion level, audio energy, length) affect performance?

Architecture

The evaluation pipeline consists of four independent stages:
1

Offset Generation

Generate synthetic test cases by applying known temporal shifts to original videos
  • Produces 24 test cases from 4 source videos × 6 offsets
  • Creates ground truth metadata with sensitivity tags
2

Batch Synchronization

Run both audio and visual sync on every test case
  • 48 synchronization runs total (24 cases × 2 methods)
  • Captures accuracy, confidence, runtime, and resource usage
3

Metrics Computation

Aggregate results into statistical summaries
  • MAE, RMSE, cross-method agreement
  • Confidence validation (Pearson correlation)
  • Efficiency and resource analysis
4

Visualization

Generate publication-ready plots
  • 8 plot types covering accuracy, confidence, efficiency
  • Per-case diagnostics (motion signals, timelines)

Directory Structure

evaluation/
├── originals/          # Place your 4 source videos here
├── synthetic/          # Generated offset-shifted videos
├── metadata/
│   └── synthetic_metadata.csv
├── results/
│   └── results.csv
├── metrics/
│   └── metrics_summary.json
├── plots/              # Publication-ready PNG plots

├── offset_generation.py
├── run_batch.py
├── compute_metrics.py
└── visualize_results.py

Key Features

Ground Truth Validation

Synthetic dataset with known offsets enables precise accuracy measurement without manual annotation

Comprehensive Metrics

Tracks accuracy, confidence, efficiency, and resource usage across all test cases

Reproducible

Fully scripted pipeline ensures consistent, repeatable results

Publication-Ready

High-DPI plots with regression analysis, boxplots, and per-case diagnostics

Sensitivity Tags

Each test case is annotated with metadata to enable sensitivity analysis:
TagDescriptionUse Case
motion_levelMean frame-to-frame pixel difference (0-1)Assess visual sync performance on low-motion vs high-motion videos
audio_energy_levelRMS energy of audio signal (0-1)Assess audio sync performance on quiet vs loud recordings
video_length_secDuration of the original videoAnalyze runtime scaling with video length
Sensitivity tags are computed automatically during offset generation by sampling the first 10 seconds (audio) or 300 frames (motion) of each video.

Quick Start

# 1. Place 4 source videos in evaluation/originals/

# 2. Run the complete pipeline
python -m evaluation.offset_generation
python -m evaluation.run_batch
python -m evaluation.compute_metrics
python -m evaluation.visualize_results

Output Artifacts

24-row CSV with ground truth offsets and sensitivity tags for each test case.Columns: video_id, synthetic_file_path, true_offset_ms, video_length_sec, motion_level, audio_energy_level
48-row CSV with per-method synchronization results.Columns: video_id, true_offset_ms, method_type, estimated_offset_ms, absolute_error_ms, confidence_score, runtime_seconds, peak_cpu_percent, peak_memory_mb, plus sensitivity tags
Structured JSON with aggregated statistics across five categories: accuracy, cross_method_agreement, confidence_validation, efficiency, resource_usage, grouped_by_tag.
High-DPI (300 DPI) publication-ready visualizations covering all evaluation dimensions.
All scripts use logging to stdout for progress tracking. Intermediate results are cached so you can re-run later stages without regenerating earlier artifacts.

Next Steps

Workflow

Detailed walkthrough of each pipeline stage

Metrics

Complete reference for all computed metrics

Visualization

Gallery and descriptions of all plot types

Build docs developers (and LLMs) love