compute_metrics.py script aggregates results from results.csv into five metric categories. All metrics are written to metrics_summary.json and printed to the console.
Metric Categories
Accuracy
Error magnitude and distribution
Cross-Method Agreement
Audio vs visual consistency
Confidence Validation
Confidence score reliability
Efficiency
Runtime performance
Resource Usage
CPU and memory consumption
Grouped (Sensitivity)
Breakdown by video characteristics
1. Accuracy Metrics
Measures how closely estimated offsets match ground truth, computed separately for audio and visual methods.Per-Method Metrics
Mean Absolute Error — Average of
|estimated - true| across all test cases.Lower is better. Typical values: 5-20 ms for audio, 10-50 ms for visual.Root Mean Square Error —
sqrt(mean((estimated - true)^2)).Penalizes large errors more heavily than MAE. Always ≥ MAE.Median Absolute Error — 50th percentile of error distribution.More robust to outliers than MAE. If median is much less than MAE, outliers are driving up the mean.
Maximum Error — Worst-case error across all test cases.Useful for understanding failure modes and worst-case latency bounds.
Number of test cases evaluated for this method.
Per-Offset Breakdown
Dictionary mapping each true offset (e.g.,
-1000, +500) to the MAE for that offset magnitude.Use Case: Identify if accuracy degrades at extreme offsets or specific offset ranges.Example
Interpretation
Expected Results:
- Audio (GCC-PHAT): MAE typically 5-15 ms for clear audio. Degrades with noisy or low-energy signals.
- Visual (Motion): MAE typically 10-50 ms depending on motion level. Low-motion videos may have higher error due to weak correlation peaks.
2. Cross-Method Agreement
Measures how consistently audio and visual methods agree on the estimated offset for the same test case.Mean of
|audio_estimate - visual_estimate| across all test cases.Lower indicates stronger cross-method agreement. Typical values: 20-50 ms.Median of the cross-method differences.More robust to outliers than the mean.
Percentage of test cases where
|audio_estimate - visual_estimate| < threshold_ms.Default threshold: 100 ms. Typical values: 60-90%.Agreement threshold in milliseconds (default: 100).
Number of test cases with both audio and visual results.
Interpretation
3. Confidence Validation
Assesses whether confidence scores reliably predict error magnitude. A good confidence metric should have negative correlation with error (high confidence = low error).Per-Method Metrics
Pearson correlation coefficient between confidence score and absolute error.
- Range: -1 (perfect negative correlation) to +1 (perfect positive correlation)
- Target: Negative values indicate confidence is a useful predictor
- Typical values: -0.3 to -0.6 for well-calibrated methods
MAE across all test cases (no filtering).
MAE after removing the bottom 20% confidence cases.Use Case: Simulate a filtering strategy where low-confidence results are rejected.If
mae_filtered < mae_all, confidence filtering improves accuracy.The 20th percentile confidence score (cases below this are filtered).
Number of test cases remaining after filtering (should be ~80% of
n_total).Total number of test cases before filtering.
Interpretation
Good calibration: Pearson < -0.3 and
mae_filtered significantly lower than mae_all.Poor calibration: Pearson near 0 or positive, minimal improvement from filtering. In this case, confidence scores are not reliable predictors and should not be used for filtering or weighting.4. Efficiency Metrics
Measures runtime performance characteristics.Per-Method Metrics
Average processing time per test case.Typical values:
- Audio: 2-5 seconds (depends on FFmpeg extraction + GCC-PHAT computation)
- Visual: 3-10 seconds (depends on video length and frame rate)
Median runtime per test case (more robust to outliers).
Sum of all runtimes for this method.Use Case: Estimate total pipeline execution time.
Mean runtime normalized by video length:
runtime_seconds / (video_length_sec / 60).Use Case: Compare efficiency across videos of different lengths.Only present if video_length_sec is available in results.csv.Interpretation
5. Resource Usage Metrics
Tracks peak CPU and memory consumption during synchronization.Per-Method Metrics
Average peak CPU usage across all test cases.Note: This is per-process CPU%, not system-wide. Values > 100% indicate multi-core utilization.
Maximum peak CPU usage observed across all test cases.
Average peak memory (RSS) in megabytes.
Maximum peak memory observed.Use Case: Ensure the pipeline fits within available system memory.
Interpretation
Typical Resource Usage:
- Audio: 200-500 MB memory, 30-60% CPU (single-threaded NumPy/SciPy operations)
- Visual: 150-400 MB memory, 30-50% CPU (OpenCV frame extraction is I/O-bound)
6. Grouped Metrics (Sensitivity Analysis)
Breaks down accuracy metrics by sensitivity tags to identify which video characteristics affect performance.Available Groupings
- video_length_sec
- motion_level
- audio_energy_level
Bins:
<30s, 30-60s, 60-120s, >120sUse Case: Assess if accuracy degrades for very short or very long videos.Structure
Example: motion_level grouping
Interpretation
Example Output
Metric Computation Details
Pearson Correlation
Computed using NumPy’scorrcoef:
compute_metrics.py:38-42
Confidence Filtering
Bottom 20% quantile:compute_metrics.py:138-145
Sensitivity Binning
Continuous tags are binned usingpd.cut:
compute_metrics.py:217-221
Next Steps
Visualization
See how metrics are visualized in publication-ready plots
Workflow
Return to the step-by-step pipeline guide