Overview
The compute_metrics module analyzes the results CSV from run_batch and computes comprehensive performance metrics including accuracy (MAE, RMSE), cross-method agreement, confidence validation, efficiency, resource usage, and sensitivity analysis.
Functions
compute_metrics
compute_metrics(
results_csv: str = RESULTS_CSV,
metrics_dir: str = METRICS_DIR,
) -> dict
Compute all evaluation metrics from results CSV.
results_csv
str
default:"evaluation/results/results.csv"
Path to the results CSV generated by run_batch.
metrics_dir
str
default:"evaluation/metrics"
Output directory for the metrics JSON file.
Returns: Dictionary containing all computed metrics.
Output Files:
metrics_dir/metrics_summary.json: Complete metrics in JSON format
- Stdout: Human-readable summary table
Metrics Computed:
-
Accuracy (per method):
- Mean Absolute Error (MAE)
- Root Mean Square Error (RMSE)
- Median Error
- Maximum Error
- MAE per offset magnitude
-
Cross-Method Agreement:
- Mean |audio − visual| difference
- Median |audio − visual| difference
- Percentage within threshold (default: 100ms)
- Number of comparable pairs
-
Confidence Validation (per method):
- Pearson correlation between confidence and error
- MAE for all samples
- MAE after filtering bottom 20% confidence
- Confidence threshold used for filtering
-
Efficiency (per method):
- Mean runtime
- Median runtime
- Total runtime
- Runtime per video minute
-
Resource Usage (per method):
- Mean peak CPU percentage
- Max peak CPU percentage
- Mean peak memory (MB)
- Max peak memory (MB)
-
Grouped Metrics (by sensitivity tags):
- MAE and RMSE grouped by:
- Video length:
<30s, 30-60s, 60-120s, >120s
- Motion level: low, medium, high, very_high
- Audio energy level: low, medium, high, very_high
Helper Functions
_safe_pearson
_safe_pearson(x: np.ndarray, y: np.ndarray) -> float
Compute Pearson correlation coefficient, returning 0.0 for degenerate cases.
Returns: Correlation coefficient in [-1, 1], or 0.0 if inputs are degenerate.
_safe_float
_safe_float(val) -> float
Convert to float, replacing NaN/Inf with 0.0 and rounding to 4 decimal places.
Returns: Sanitized float value.
_print_summary
_print_summary(metrics: dict)
Print a human-readable summary table to stdout.
Metrics dictionary returned by compute_metrics.
Usage Example
import logging
from evaluation.compute_metrics import compute_metrics
logging.basicConfig(level=logging.INFO)
# Compute metrics from results
metrics = compute_metrics(
results_csv="evaluation/results/results.csv",
metrics_dir="evaluation/metrics",
)
# Access specific metrics
print(f"Audio MAE: {metrics['accuracy']['audio']['mae_ms']:.2f} ms")
print(f"Visual MAE: {metrics['accuracy']['visual']['mae_ms']:.2f} ms")
print(f"Cross-method agreement: {metrics['cross_method_agreement']['pct_within_threshold']:.1f}%")
CLI Usage
python -m evaluation.compute_metrics
Outputs:
- Metrics JSON to
evaluation/metrics/metrics_summary.json
- Human-readable summary table to stdout
Configuration
Default Paths
BASE_DIR = os.path.dirname(os.path.abspath(__file__))
RESULTS_CSV = os.path.join(BASE_DIR, "results", "results.csv")
METRICS_DIR = os.path.join(BASE_DIR, "metrics")
Constants
AGREEMENT_THRESHOLD_MS = 100.0 # Threshold for cross-method agreement reporting
CONFIDENCE_FILTER_QUANTILE = 0.20 # Bottom 20% confidence filtering
Example Output
JSON Structure
{
"accuracy": {
"audio": {
"mae_ms": 42.35,
"rmse_ms": 58.12,
"median_error_ms": 31.50,
"max_error_ms": 215.00,
"count": 36,
"mae_per_offset_ms": {
"-1000": 38.20,
"-500": 28.45,
"100": 15.30,
"500": 45.60,
"1000": 52.10
}
},
"visual": { /* ... */ }
},
"cross_method_agreement": {
"mean_audio_video_diff_ms": 65.23,
"median_audio_video_diff_ms": 48.15,
"pct_within_threshold": 72.5,
"threshold_ms": 100.0,
"n_pairs": 36
},
"confidence_validation": { /* ... */ },
"efficiency": { /* ... */ },
"resource_usage": { /* ... */ },
"grouped_by_tag": { /* ... */ }
}
Console Output
======================================================================
EVALUATION METRICS SUMMARY
======================================================================
── Accuracy ───────────────────────────────────────────
[AUDIO]
MAE: 42.35 ms
RMSE: 58.12 ms
Median: 31.50 ms
Max: 215.00 ms
Count: 36
Per-offset MAE:
-1000 ms → 38.20 ms error
-500 ms → 28.45 ms error
100 ms → 15.30 ms error
500 ms → 45.60 ms error
1000 ms → 52.10 ms error
[VISUAL]
...
── Cross-Method Agreement ─────────────────────────────
Mean |audio − visual| diff: 65.23 ms
% within 100ms: 72.5%
── Confidence Validation ──────────────────────────────
[AUDIO]
Pearson(confidence, error): -0.4523
MAE (all): 42.35 ms
MAE (filtered, top 80%): 35.12 ms
── Efficiency ─────────────────────────────────────────
[AUDIO]
Mean runtime: 3.45s
Per video-min: 2.12s
── Resource Usage ─────────────────────────────────────
[AUDIO]
Mean peak CPU: 45.2%
Max peak CPU: 78.5%
Mean peak memory: 256.3 MB
Max peak memory: 412.8 MB
======================================================================
Notes
- Error Metrics: All error metrics are in milliseconds for easy interpretation.
- Confidence Correlation: Negative Pearson correlation indicates higher confidence correlates with lower error (desired behavior).
- Filtering: Confidence filtering removes the bottom 20% of results by confidence score, demonstrating the value of confidence metrics.
- Sensitivity Analysis: Grouped metrics reveal performance patterns across different video characteristics.