Purpose
The evaluation suite enables quantitative analysis of:- Accuracy: How closely do estimated offsets match ground truth?
- Confidence Reliability: Do confidence scores correlate with error magnitude?
- Cross-Method Agreement: How consistently do audio and visual methods agree?
- Efficiency: Runtime performance and resource usage characteristics
- Sensitivity: How do different video characteristics (motion level, audio energy, length) affect performance?
Architecture
The evaluation pipeline consists of four independent stages:Offset Generation
Generate synthetic test cases by applying known temporal shifts to original videos
- Produces 24 test cases from 4 source videos × 6 offsets
- Creates ground truth metadata with sensitivity tags
Batch Synchronization
Run both audio and visual sync on every test case
- 48 synchronization runs total (24 cases × 2 methods)
- Captures accuracy, confidence, runtime, and resource usage
Metrics Computation
Aggregate results into statistical summaries
- MAE, RMSE, cross-method agreement
- Confidence validation (Pearson correlation)
- Efficiency and resource analysis
Directory Structure
Key Features
Ground Truth Validation
Synthetic dataset with known offsets enables precise accuracy measurement without manual annotation
Comprehensive Metrics
Tracks accuracy, confidence, efficiency, and resource usage across all test cases
Reproducible
Fully scripted pipeline ensures consistent, repeatable results
Publication-Ready
High-DPI plots with regression analysis, boxplots, and per-case diagnostics
Sensitivity Tags
Each test case is annotated with metadata to enable sensitivity analysis:| Tag | Description | Use Case |
|---|---|---|
motion_level | Mean frame-to-frame pixel difference (0-1) | Assess visual sync performance on low-motion vs high-motion videos |
audio_energy_level | RMS energy of audio signal (0-1) | Assess audio sync performance on quiet vs loud recordings |
video_length_sec | Duration of the original video | Analyze runtime scaling with video length |
Sensitivity tags are computed automatically during offset generation by sampling the first 10 seconds (audio) or 300 frames (motion) of each video.
Quick Start
Output Artifacts
synthetic_metadata.csv
synthetic_metadata.csv
24-row CSV with ground truth offsets and sensitivity tags for each test case.Columns:
video_id, synthetic_file_path, true_offset_ms, video_length_sec, motion_level, audio_energy_levelresults.csv
results.csv
48-row CSV with per-method synchronization results.Columns:
video_id, true_offset_ms, method_type, estimated_offset_ms, absolute_error_ms, confidence_score, runtime_seconds, peak_cpu_percent, peak_memory_mb, plus sensitivity tagsmetrics_summary.json
metrics_summary.json
Structured JSON with aggregated statistics across five categories:
accuracy, cross_method_agreement, confidence_validation, efficiency, resource_usage, grouped_by_tag.plots/ (8 PNG files + subdirectories)
plots/ (8 PNG files + subdirectories)
High-DPI (300 DPI) publication-ready visualizations covering all evaluation dimensions.
All scripts use logging to stdout for progress tracking. Intermediate results are cached so you can re-run later stages without regenerating earlier artifacts.
Next Steps
Workflow
Detailed walkthrough of each pipeline stage
Metrics
Complete reference for all computed metrics
Visualization
Gallery and descriptions of all plot types