Skip to main content
The compute_metrics.py script aggregates results from results.csv into five metric categories. All metrics are written to metrics_summary.json and printed to the console.

Metric Categories

Accuracy

Error magnitude and distribution

Cross-Method Agreement

Audio vs visual consistency

Confidence Validation

Confidence score reliability

Efficiency

Runtime performance

Resource Usage

CPU and memory consumption

Grouped (Sensitivity)

Breakdown by video characteristics

1. Accuracy Metrics

Measures how closely estimated offsets match ground truth, computed separately for audio and visual methods.

Per-Method Metrics

mae_ms
float
Mean Absolute Error — Average of |estimated - true| across all test cases.Lower is better. Typical values: 5-20 ms for audio, 10-50 ms for visual.
rmse_ms
float
Root Mean Square Errorsqrt(mean((estimated - true)^2)).Penalizes large errors more heavily than MAE. Always ≥ MAE.
median_error_ms
float
Median Absolute Error — 50th percentile of error distribution.More robust to outliers than MAE. If median is much less than MAE, outliers are driving up the mean.
max_error_ms
float
Maximum Error — Worst-case error across all test cases.Useful for understanding failure modes and worst-case latency bounds.
count
int
Number of test cases evaluated for this method.

Per-Offset Breakdown

mae_per_offset_ms
object
Dictionary mapping each true offset (e.g., -1000, +500) to the MAE for that offset magnitude.Use Case: Identify if accuracy degrades at extreme offsets or specific offset ranges.
Example
{
  "-1000": 8.2,
  "-500": 11.3,
  "-100": 15.7,
  "100": 14.2,
  "500": 10.8,
  "1000": 9.1
}

Interpretation

Expected Results:
  • Audio (GCC-PHAT): MAE typically 5-15 ms for clear audio. Degrades with noisy or low-energy signals.
  • Visual (Motion): MAE typically 10-50 ms depending on motion level. Low-motion videos may have higher error due to weak correlation peaks.
If MAE significantly exceeds 50 ms for audio or 100 ms for visual, check for:
  • Extreme offsets beyond the correlation window (max_offset_sec)
  • Very low motion or audio energy
  • Codec issues or corrupted source videos

2. Cross-Method Agreement

Measures how consistently audio and visual methods agree on the estimated offset for the same test case.
mean_audio_video_diff_ms
float
Mean of |audio_estimate - visual_estimate| across all test cases.Lower indicates stronger cross-method agreement. Typical values: 20-50 ms.
median_audio_video_diff_ms
float
Median of the cross-method differences.More robust to outliers than the mean.
pct_within_threshold
float
Percentage of test cases where |audio_estimate - visual_estimate| < threshold_ms.Default threshold: 100 ms. Typical values: 60-90%.
threshold_ms
float
Agreement threshold in milliseconds (default: 100).
n_pairs
int
Number of test cases with both audio and visual results.

Interpretation

High Agreement (pct_within_threshold > 80%):
  • Both methods are likely estimating the correct offset
  • Confidence scores from both methods can be combined (e.g., weighted average)
Low Agreement (pct_within_threshold < 50%):
  • One or both methods may be failing on certain video types
  • Check grouped metrics to identify which sensitivity tags correlate with disagreement

3. Confidence Validation

Assesses whether confidence scores reliably predict error magnitude. A good confidence metric should have negative correlation with error (high confidence = low error).

Per-Method Metrics

pearson_confidence_vs_error
float
Pearson correlation coefficient between confidence score and absolute error.
  • Range: -1 (perfect negative correlation) to +1 (perfect positive correlation)
  • Target: Negative values indicate confidence is a useful predictor
  • Typical values: -0.3 to -0.6 for well-calibrated methods
mae_all_ms
float
MAE across all test cases (no filtering).
mae_filtered_ms
float
MAE after removing the bottom 20% confidence cases.Use Case: Simulate a filtering strategy where low-confidence results are rejected.If mae_filtered < mae_all, confidence filtering improves accuracy.
confidence_threshold
float
The 20th percentile confidence score (cases below this are filtered).
n_after_filter
int
Number of test cases remaining after filtering (should be ~80% of n_total).
n_total
int
Total number of test cases before filtering.

Interpretation

{
  "audio": {
    "pearson_confidence_vs_error": -0.52,
    "mae_all_ms": 12.3,
    "mae_filtered_ms": 8.7,
    "confidence_threshold": 0.75,
    "n_after_filter": 19,
    "n_total": 24
  }
}
Good calibration: Pearson < -0.3 and mae_filtered significantly lower than mae_all.Poor calibration: Pearson near 0 or positive, minimal improvement from filtering. In this case, confidence scores are not reliable predictors and should not be used for filtering or weighting.

4. Efficiency Metrics

Measures runtime performance characteristics.

Per-Method Metrics

mean_runtime_seconds
float
Average processing time per test case.Typical values:
  • Audio: 2-5 seconds (depends on FFmpeg extraction + GCC-PHAT computation)
  • Visual: 3-10 seconds (depends on video length and frame rate)
median_runtime_seconds
float
Median runtime per test case (more robust to outliers).
total_runtime_seconds
float
Sum of all runtimes for this method.Use Case: Estimate total pipeline execution time.
runtime_per_video_minute
float
default:"(optional)"
Mean runtime normalized by video length: runtime_seconds / (video_length_sec / 60).Use Case: Compare efficiency across videos of different lengths.Only present if video_length_sec is available in results.csv.

Interpretation

Expected Runtime Scaling:
  • Audio: Roughly linear with video length (FFmpeg extraction is the bottleneck)
  • Visual: Superlinear with video length due to frame-by-frame processing and cross-correlation window size
If runtime_per_video_minute is constant, the method scales linearly. If it increases with video length, there’s superlinear scaling.

5. Resource Usage Metrics

Tracks peak CPU and memory consumption during synchronization.

Per-Method Metrics

mean_peak_cpu_percent
float
Average peak CPU usage across all test cases.Note: This is per-process CPU%, not system-wide. Values > 100% indicate multi-core utilization.
max_peak_cpu_percent
float
Maximum peak CPU usage observed across all test cases.
mean_peak_memory_mb
float
Average peak memory (RSS) in megabytes.
max_peak_memory_mb
float
Maximum peak memory observed.Use Case: Ensure the pipeline fits within available system memory.

Interpretation

Typical Resource Usage:
  • Audio: 200-500 MB memory, 30-60% CPU (single-threaded NumPy/SciPy operations)
  • Visual: 150-400 MB memory, 30-50% CPU (OpenCV frame extraction is I/O-bound)
Both methods use temporary directories that are cleaned up after each case, so memory does not accumulate across test cases.
If peak memory exceeds 1 GB, consider:
  • Processing shorter videos
  • Reducing the correlation window size (max_offset_sec)
  • Downsampling audio or video resolution before correlation

6. Grouped Metrics (Sensitivity Analysis)

Breaks down accuracy metrics by sensitivity tags to identify which video characteristics affect performance.

Available Groupings

Bins: <30s, 30-60s, 60-120s, >120sUse Case: Assess if accuracy degrades for very short or very long videos.

Structure

Example: motion_level grouping
{
  "grouped_by_tag": {
    "motion_level": {
      "audio": {
        "low": { "mae_ms": 12.3, "rmse_ms": 15.2, "count": 6 },
        "medium": { "mae_ms": 11.8, "rmse_ms": 14.5, "count": 12 },
        "high": { "mae_ms": 10.5, "rmse_ms": 13.1, "count": 6 }
      },
      "visual": {
        "low": { "mae_ms": 45.7, "rmse_ms": 58.3, "count": 6 },
        "medium": { "mae_ms": 28.2, "rmse_ms": 35.1, "count": 12 },
        "high": { "mae_ms": 18.4, "rmse_ms": 22.9, "count": 6 }
      }
    }
  }
}

Interpretation

Low motion:    MAE = 45.7 ms  (weak correlation peaks)
Medium motion: MAE = 28.2 ms
High motion:   MAE = 18.4 ms  (strong correlation peaks)

→ Visual sync accuracy improves with motion level
Using Grouped Metrics:
  1. Identify which tag bins have significantly higher MAE
  2. Check if cross-method agreement is lower for those bins
  3. Consider hybrid strategies: use audio sync for low-motion videos, visual sync for silent videos

Example Output

{
  "accuracy": {
    "audio": {
      "mae_ms": 12.34,
      "rmse_ms": 15.67,
      "median_error_ms": 10.5,
      "max_error_ms": 45.2,
      "count": 24,
      "mae_per_offset_ms": { "-1000": 8.2, "-500": 11.3, ... }
    },
    "visual": { ... }
  },
  "cross_method_agreement": {
    "mean_audio_video_diff_ms": 23.45,
    "pct_within_threshold": 87.5,
    "threshold_ms": 100.0
  },
  "confidence_validation": {
    "audio": {
      "pearson_confidence_vs_error": -0.52,
      "mae_all_ms": 12.34,
      "mae_filtered_ms": 8.7
    },
    "visual": { ... }
  },
  "efficiency": {
    "audio": { "mean_runtime_seconds": 3.42, "runtime_per_video_minute": 2.1 },
    "visual": { "mean_runtime_seconds": 5.12, "runtime_per_video_minute": 3.8 }
  },
  "resource_usage": {
    "audio": { "mean_peak_cpu_percent": 45.2, "mean_peak_memory_mb": 312.4 },
    "visual": { "mean_peak_cpu_percent": 38.1, "mean_peak_memory_mb": 256.7 }
  },
  "grouped_by_tag": { ... }
}

Metric Computation Details

Pearson Correlation

Computed using NumPy’s corrcoef:
compute_metrics.py:38-42
def _safe_pearson(x: np.ndarray, y: np.ndarray) -> float:
    if len(x) < 3 or np.std(x) < 1e-12 or np.std(y) < 1e-12:
        return 0.0
    return float(np.corrcoef(x, y)[0, 1])

Confidence Filtering

Bottom 20% quantile:
compute_metrics.py:138-145
threshold = np.quantile(confs, CONFIDENCE_FILTER_QUANTILE)
mask = confs >= threshold
if mask.sum() > 0:
    mae_filtered = np.mean(errors[mask])

Sensitivity Binning

Continuous tags are binned using pd.cut:
compute_metrics.py:217-221
if tag_col == "video_length_sec":
    bins = [0, 30, 60, 120, float("inf")]
    labels = ["<30s", "30-60s", "60-120s", ">120s"]
else:
    bins = [0, 0.25, 0.5, 0.75, float("inf")]
    labels = ["low", "medium", "high", "very_high"]

Next Steps

Visualization

See how metrics are visualized in publication-ready plots

Workflow

Return to the step-by-step pipeline guide

Build docs developers (and LLMs) love