Evaluation Metrics

The compute_metrics.py script aggregates results from results.csv into five metric categories. All metrics are written to metrics_summary.json and printed to the console.

Metric Categories

Accuracy

Error magnitude and distribution

Cross-Method Agreement

Audio vs visual consistency

Confidence Validation

Confidence score reliability

Efficiency

Runtime performance

Resource Usage

CPU and memory consumption

Grouped (Sensitivity)

Breakdown by video characteristics

1. Accuracy Metrics

Measures how closely estimated offsets match ground truth, computed separately for audio and visual methods.

Per-Method Metrics

mae_ms

float

Mean Absolute Error — Average of |estimated - true| across all test cases.Lower is better. Typical values: 5-20 ms for audio, 10-50 ms for visual.

rmse_ms

float

Root Mean Square Error — sqrt(mean((estimated - true)^2)).Penalizes large errors more heavily than MAE. Always ≥ MAE.

median_error_ms

float

Median Absolute Error — 50th percentile of error distribution.More robust to outliers than MAE. If median is much less than MAE, outliers are driving up the mean.

max_error_ms

float

Maximum Error — Worst-case error across all test cases.Useful for understanding failure modes and worst-case latency bounds.

count

int

Number of test cases evaluated for this method.

Per-Offset Breakdown

mae_per_offset_ms

object

Dictionary mapping each true offset (e.g., -1000, +500) to the MAE for that offset magnitude.Use Case: Identify if accuracy degrades at extreme offsets or specific offset ranges.

Example

{
  "-1000": 8.2,
  "-500": 11.3,
  "-100": 15.7,
  "100": 14.2,
  "500": 10.8,
  "1000": 9.1
}

Interpretation

Expected Results:

Audio (GCC-PHAT): MAE typically 5-15 ms for clear audio. Degrades with noisy or low-energy signals.
Visual (Motion): MAE typically 10-50 ms depending on motion level. Low-motion videos may have higher error due to weak correlation peaks.

If MAE significantly exceeds 50 ms for audio or 100 ms for visual, check for:

Extreme offsets beyond the correlation window (max_offset_sec)
Very low motion or audio energy
Codec issues or corrupted source videos

2. Cross-Method Agreement

Measures how consistently audio and visual methods agree on the estimated offset for the same test case.

mean_audio_video_diff_ms

float

Mean of |audio_estimate - visual_estimate| across all test cases.Lower indicates stronger cross-method agreement. Typical values: 20-50 ms.

median_audio_video_diff_ms

float

Median of the cross-method differences.More robust to outliers than the mean.

pct_within_threshold

float

Percentage of test cases where |audio_estimate - visual_estimate| < threshold_ms.Default threshold: 100 ms. Typical values: 60-90%.

threshold_ms

float

Agreement threshold in milliseconds (default: 100).

n_pairs

int

Number of test cases with both audio and visual results.

Interpretation

High Agreement (pct_within_threshold > 80%):

Both methods are likely estimating the correct offset
Confidence scores from both methods can be combined (e.g., weighted average)

Low Agreement (pct_within_threshold < 50%):

One or both methods may be failing on certain video types
Check grouped metrics to identify which sensitivity tags correlate with disagreement

3. Confidence Validation

Assesses whether confidence scores reliably predict error magnitude. A good confidence metric should have negative correlation with error (high confidence = low error).

Per-Method Metrics

pearson_confidence_vs_error

float

Pearson correlation coefficient between confidence score and absolute error.

Range: -1 (perfect negative correlation) to +1 (perfect positive correlation)
Target: Negative values indicate confidence is a useful predictor
Typical values: -0.3 to -0.6 for well-calibrated methods

mae_all_ms

float

MAE across all test cases (no filtering).

mae_filtered_ms

float

MAE after removing the bottom 20% confidence cases.Use Case: Simulate a filtering strategy where low-confidence results are rejected.If mae_filtered < mae_all, confidence filtering improves accuracy.

confidence_threshold

float

The 20th percentile confidence score (cases below this are filtered).

n_after_filter

int

Number of test cases remaining after filtering (should be ~80% of n_total).

n_total

int

Total number of test cases before filtering.

Interpretation

{
  "audio": {
    "pearson_confidence_vs_error": -0.52,
    "mae_all_ms": 12.3,
    "mae_filtered_ms": 8.7,
    "confidence_threshold": 0.75,
    "n_after_filter": 19,
    "n_total": 24
  }
}

Good calibration: Pearson < -0.3 and mae_filtered significantly lower than mae_all.Poor calibration: Pearson near 0 or positive, minimal improvement from filtering. In this case, confidence scores are not reliable predictors and should not be used for filtering or weighting.

4. Efficiency Metrics

Measures runtime performance characteristics.

Per-Method Metrics

mean_runtime_seconds

float

Average processing time per test case.Typical values:

Audio: 2-5 seconds (depends on FFmpeg extraction + GCC-PHAT computation)
Visual: 3-10 seconds (depends on video length and frame rate)

median_runtime_seconds

float

Median runtime per test case (more robust to outliers).

total_runtime_seconds

float

Sum of all runtimes for this method.Use Case: Estimate total pipeline execution time.

runtime_per_video_minute

float

default:"(optional)"

Mean runtime normalized by video length: runtime_seconds / (video_length_sec / 60).Use Case: Compare efficiency across videos of different lengths.Only present if video_length_sec is available in results.csv.

Interpretation

Expected Runtime Scaling:

Audio: Roughly linear with video length (FFmpeg extraction is the bottleneck)
Visual: Superlinear with video length due to frame-by-frame processing and cross-correlation window size

If runtime_per_video_minute is constant, the method scales linearly. If it increases with video length, there’s superlinear scaling.

5. Resource Usage Metrics

Tracks peak CPU and memory consumption during synchronization.

Per-Method Metrics

mean_peak_cpu_percent

float

Average peak CPU usage across all test cases.Note: This is per-process CPU%, not system-wide. Values > 100% indicate multi-core utilization.

max_peak_cpu_percent

float

Maximum peak CPU usage observed across all test cases.

mean_peak_memory_mb

float

Average peak memory (RSS) in megabytes.

max_peak_memory_mb

float

Maximum peak memory observed.Use Case: Ensure the pipeline fits within available system memory.

Interpretation

Typical Resource Usage:

Audio: 200-500 MB memory, 30-60% CPU (single-threaded NumPy/SciPy operations)
Visual: 150-400 MB memory, 30-50% CPU (OpenCV frame extraction is I/O-bound)

Both methods use temporary directories that are cleaned up after each case, so memory does not accumulate across test cases.

If peak memory exceeds 1 GB, consider:

Processing shorter videos
Reducing the correlation window size (max_offset_sec)
Downsampling audio or video resolution before correlation

6. Grouped Metrics (Sensitivity Analysis)

Breaks down accuracy metrics by sensitivity tags to identify which video characteristics affect performance.

Available Groupings

video_length_sec
motion_level
audio_energy_level

Bins: <30s, 30-60s, 60-120s, >120sUse Case: Assess if accuracy degrades for very short or very long videos.

Bins: low (< 0.25), medium (0.25-0.5), high (0.5-0.75), very_high (≥ 0.75)Use Case: Identify if visual sync fails on low-motion videos.

Bins: low (< 0.25), medium (0.25-0.5), high (0.5-0.75), very_high (≥ 0.75)Use Case: Identify if audio sync fails on quiet videos.

Structure

Example: motion_level grouping

{
  "grouped_by_tag": {
    "motion_level": {
      "audio": {
        "low": { "mae_ms": 12.3, "rmse_ms": 15.2, "count": 6 },
        "medium": { "mae_ms": 11.8, "rmse_ms": 14.5, "count": 12 },
        "high": { "mae_ms": 10.5, "rmse_ms": 13.1, "count": 6 }
      },
      "visual": {
        "low": { "mae_ms": 45.7, "rmse_ms": 58.3, "count": 6 },
        "medium": { "mae_ms": 28.2, "rmse_ms": 35.1, "count": 12 },
        "high": { "mae_ms": 18.4, "rmse_ms": 22.9, "count": 6 }
      }
    }
  }
}

Interpretation

Low motion:    MAE = 45.7 ms  (weak correlation peaks)
Medium motion: MAE = 28.2 ms
High motion:   MAE = 18.4 ms  (strong correlation peaks)

→ Visual sync accuracy improves with motion level

Using Grouped Metrics:

Identify which tag bins have significantly higher MAE
Check if cross-method agreement is lower for those bins
Consider hybrid strategies: use audio sync for low-motion videos, visual sync for silent videos

Example Output

{
  "accuracy": {
    "audio": {
      "mae_ms": 12.34,
      "rmse_ms": 15.67,
      "median_error_ms": 10.5,
      "max_error_ms": 45.2,
      "count": 24,
      "mae_per_offset_ms": { "-1000": 8.2, "-500": 11.3, ... }
    },
    "visual": { ... }
  },
  "cross_method_agreement": {
    "mean_audio_video_diff_ms": 23.45,
    "pct_within_threshold": 87.5,
    "threshold_ms": 100.0
  },
  "confidence_validation": {
    "audio": {
      "pearson_confidence_vs_error": -0.52,
      "mae_all_ms": 12.34,
      "mae_filtered_ms": 8.7
    },
    "visual": { ... }
  },
  "efficiency": {
    "audio": { "mean_runtime_seconds": 3.42, "runtime_per_video_minute": 2.1 },
    "visual": { "mean_runtime_seconds": 5.12, "runtime_per_video_minute": 3.8 }
  },
  "resource_usage": {
    "audio": { "mean_peak_cpu_percent": 45.2, "mean_peak_memory_mb": 312.4 },
    "visual": { "mean_peak_cpu_percent": 38.1, "mean_peak_memory_mb": 256.7 }
  },
  "grouped_by_tag": { ... }
}

Metric Computation Details

Pearson Correlation

Computed using NumPy’s corrcoef:

compute_metrics.py:38-42

def _safe_pearson(x: np.ndarray, y: np.ndarray) -> float:
    if len(x) < 3 or np.std(x) < 1e-12 or np.std(y) < 1e-12:
        return 0.0
    return float(np.corrcoef(x, y)[0, 1])

Confidence Filtering

Bottom 20% quantile:

compute_metrics.py:138-145

threshold = np.quantile(confs, CONFIDENCE_FILTER_QUANTILE)
mask = confs >= threshold
if mask.sum() > 0:
    mae_filtered = np.mean(errors[mask])

Sensitivity Binning

Continuous tags are binned using pd.cut:

compute_metrics.py:217-221

if tag_col == "video_length_sec":
    bins = [0, 30, 60, 120, float("inf")]
    labels = ["<30s", "30-60s", "60-120s", ">120s"]
else:
    bins = [0, 0.25, 0.5, 0.75, float("inf")]
    labels = ["low", "medium", "high", "very_high"]

Get Started

Core Concepts

User Guide

Evaluation Suite

Metric Categories

Accuracy

Cross-Method Agreement

Confidence Validation

Efficiency

Resource Usage

Grouped (Sensitivity)

1. Accuracy Metrics

Per-Method Metrics

Per-Offset Breakdown

Interpretation

2. Cross-Method Agreement

Interpretation

3. Confidence Validation

Per-Method Metrics

Interpretation

4. Efficiency Metrics

Per-Method Metrics

Interpretation

5. Resource Usage Metrics

Per-Method Metrics

Interpretation

6. Grouped Metrics (Sensitivity Analysis)

Available Groupings

Structure

Interpretation

Example Output

Metric Computation Details

Pearson Correlation

Confidence Filtering

Sensitivity Binning

Next Steps

Visualization

Workflow

Build docs developers (and LLMs) love

Get Started

Core Concepts

User Guide

Evaluation Suite

​Metric Categories

Accuracy

Cross-Method Agreement

Confidence Validation

Efficiency

Resource Usage

Grouped (Sensitivity)

​1. Accuracy Metrics

​Per-Method Metrics

​Per-Offset Breakdown

​Interpretation

​2. Cross-Method Agreement

​Interpretation

​3. Confidence Validation

​Per-Method Metrics

​Interpretation

​4. Efficiency Metrics

​Per-Method Metrics

​Interpretation

​5. Resource Usage Metrics

​Per-Method Metrics

​Interpretation

​6. Grouped Metrics (Sensitivity Analysis)

​Available Groupings

​Structure

​Interpretation

​Example Output

​Metric Computation Details

​Pearson Correlation

​Confidence Filtering

​Sensitivity Binning

​Next Steps

Visualization

Workflow

Build docs developers (and LLMs) love

Metric Categories

1. Accuracy Metrics

Per-Method Metrics

Per-Offset Breakdown

Interpretation

2. Cross-Method Agreement

Interpretation

3. Confidence Validation

Per-Method Metrics

Interpretation

4. Efficiency Metrics

Per-Method Metrics

Interpretation

5. Resource Usage Metrics

Per-Method Metrics

Interpretation

6. Grouped Metrics (Sensitivity Analysis)

Available Groupings

Structure

Interpretation

Example Output

Metric Computation Details

Pearson Correlation

Confidence Filtering

Sensitivity Binning

Next Steps