The SSRL-ECG framework evaluates cardiovascular disease classification across five diagnostic superclasses using four complementary metrics drawn from clinical practice. Each metric addresses a different aspect of model quality, from class discrimination to signal-level robustness, giving you a complete picture of how a checkpoint will behave under real-world conditions.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/Tumo505/SSL-for-ECG-classification/llms.txt
Use this file to discover all available pages before exploring further.
Metric Definitions
Understanding what each metric measures — and why it matters clinically — is essential before interpreting results or comparing checkpoints.AUROC (Area Under the ROC Curve)
AUROC summarises a model’s ability to discriminate between positive and negative cases across all decision thresholds. A value of 1.0 indicates perfect separation; 0.5 is random. In multi-label ECG classification, macro AUROC averages the per-class AUC scores, giving equal weight to each of the five cardiovascular conditions regardless of class frequency. This makes it the primary headline metric for comparing SSL and supervised checkpoints.F1 Score (Macro)
F1 is the harmonic mean of precision and recall. Using macro averaging treats all five classes equally, preventing common classes like NORM from dominating the score. At the default threshold of0.5, sigmoid probabilities are binarised and F1 is computed. F1 is especially useful for quantifying the real-world trade-off when both false positives and false negatives carry clinical cost.
Sensitivity (Micro)
Sensitivity — the true positive rate — measures what fraction of actual disease cases the model correctly flags. It is computed micro-averaged across all classes and samples, meaning a single missed MI or HYP prediction directly lowers the score. For cardiovascular screening, sensitivity is critical: missed diagnoses carry higher clinical cost than false alarms.Specificity (Micro)
Specificity — the true negative rate — measures what fraction of healthy samples are correctly left unflagged. High specificity reduces alert fatigue in clinical workflows. Like sensitivity, it is computed micro-averaged across all five classes.multilabel_metrics() Function
All four metrics are computed by the multilabel_metrics utility in ssrl_ecg.utils. Probabilities are converted to binary predictions using a default threshold of 0.5.
Ground-truth binary label matrix of shape
(N, C) where N is the number of samples and C is the number of classes (5 for PTB-XL).Predicted probability matrix of shape
(N, C), output of torch.sigmoid(logits).Decision threshold for converting probabilities to binary predictions. Values at or above the threshold are predicted positive.
Macro-averaged F1 score across all five cardiovascular classes. Computed with
zero_division=0 to handle classes absent from a batch.Macro-averaged AUROC. Per-class AUC is computed only for classes that have both positive and negative samples; the mean is taken over those valid classes.
Micro-averaged sensitivity (true positive rate):
TP / (TP + FN). A small epsilon (1e-8) prevents division by zero.Micro-averaged specificity (true negative rate):
TN / (TN + FP). A small epsilon prevents division by zero.evaluate.py CLI
The evaluate module runs a trained checkpoint against the PTB-XL test set (fold 10) and prints all four metrics. It also supports signal corruption via --noise-std and --mask-ratio for robustness testing.
Arguments
Path to the model checkpoint file (
.pt). The checkpoint must contain a model key with the full ECGClassifier state dict.Root directory of the PTB-XL dataset. Must contain
ptbxl_database.csv, scp_statements.csv, and the records100/ or records500/ subdirectory.Number of samples per inference batch. Reduce if GPU memory is limited.
Number of time-steps per ECG sample (1,000 corresponds to 10 seconds at 100 Hz).
Standard deviation of additive Gaussian noise applied to each signal before inference. Set to
0.1 to simulate moderate measurement noise. Uses CorruptedWrapper internally.Fraction of the time dimension zeroed out as a contiguous block. Set to
0.2 to simulate 20% signal dropout. Uses CorruptedWrapper internally.Robustness Testing with CorruptedWrapper
The CorruptedWrapper class wraps any PTBXLRecordDataset and applies on-the-fly signal degradation to test how well a checkpoint generalises under real-world noise sources such as motion artifacts or electrode dropout.
- Additive Noise
- Temporal Masking
- Combined
The corruption is applied after loading the raw signal and before passing it to the model. This mirrors realistic inference conditions where signal quality cannot be controlled.
Benchmark Results
The table below shows clean-signal test-set performance for all three training strategies evaluated on PTB-XL fold 10 (2,194 samples across 5 cardiovascular classes).| Method | AUROC | F1 | Sensitivity | Specificity |
|---|---|---|---|---|
| Supervised (Focal+Oversample) | 0.8606 | 0.5750 | 0.6772 | 0.9357 |
| SimCLR + Augmentations | 0.8717 | 0.6448 | 0.6831 | 0.9411 |
| BYOL + Augmentations | 0.8565 | 0.6301 | 0.6648 | 0.9278 |
Per-Class Coverage
All five diagnostic classes exceed the 0.61 sensitivity threshold after SimCLR fine-tuning:NORM
Normal sinus rhythm — largest class (9,514 training samples)
MI
Myocardial infarction — 5,469 training samples
STTC
ST/T-wave changes — 5,235 training samples
HYP
Left ventricular hypertrophy — 2,649 training samples
CD
Conduction disturbance — 4,898 training samples
Multi-Seed Validation
To confirm that results are not seed-dependent, SimCLR fine-tuning was repeated across 10 random seeds usingscripts/run_multiseed_training.py.
| Metric | Mean | Std | 95% CI |
|---|---|---|---|
| AUROC (macro) | 0.8717 | ±0.0032 | 0.8671 – 0.8763 |
| F1 (macro) | 0.6448 | ±0.0181 | — |
The narrow 95% confidence interval on AUROC (0.0092 wide) confirms that the SimCLR improvements are statistically robust and not artifacts of a single lucky seed.