Self-supervised learning (SSL) offers a powerful solution to the chronic label scarcity in clinical cardiology. Rather than relying on expensive expert annotations for every waveform, SSL first teaches a neural network to recognize structural similarities between different augmented views of the same ECG recording. The resulting encoder captures rhythm, morphology, and inter-lead relationships that transfer directly to downstream disease classification — no labels needed during pretraining.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/Tumo505/SSL-for-ECG-classification/llms.txt
Use this file to discover all available pages before exploring further.
The Two-View Contrastive Framework
The core idea is elegantly simple: for every ECG sample in a training batch, apply two independent stochastic augmentation pipelines to produce two “views” — perturbed versions that look different on the surface but must share the same underlying cardiac identity. The encoder is then trained so that embeddings of the same sample’s two views are pulled together in representation space, while embeddings of different samples are pushed apart.ECGAugmentations.__call__() encapsulates exactly this step. It accepts a tensor of shape [batch, channels, time] and returns two independently augmented copies (x1, x2) ready for the SSL objective.
Creating Two Views with SimCLRAugmentations
SimCLRAugmentations is a thin wrapper around ECGAugmentations that forwards both calls through the same domain-adaptive pipeline. The prob=0.8 parameter controls how often the heavier strong-augmentation branch fires (see Domain-Adaptive Augmentations for the full breakdown).
SimCLR: Contrastive Learning Without Labels
SimCLR (Chen et al., ICML 2020) is the recommended SSL framework in SSRL-ECG, achieving AUROC 0.8717 on PTB-XL after fine-tuning on just 10% of labeled data.Architecture
TheSimCLRModel wraps any encoder backbone (default: ECGEncoder1DCNN with 256-dim output) with a two-layer SimCLRProjectionHead:
h after global average pooling. The projection head maps this to a 128-dimensional unit sphere vector z, which is used exclusively during the contrastive pretraining phase and discarded at fine-tuning time.
NT-Xent Loss
The Normalized Temperature-scaled Cross Entropy (NT-Xent) loss treats the two augmented views of the same sample as a positive pair, and all other2(N−1) samples in the batch as negatives:
- Normalize both projection vectors:
z̃ = z / ‖z‖₂ - Build a
2N × 2Ncosine similarity matrix across the concatenated batch - Scale by temperature τ = 0.07 (a low τ sharpens the distribution, enforcing tight clusters)
- Apply cross-entropy: the “correct class” for each view is its paired counterpart
The projection head output
z (dim=128) is used only for the NT-Xent loss during pretraining. Downstream classification always uses the encoder representation h (dim=256), which retains richer structural information.BYOL: Momentum-Based Learning Without Negatives
BYOL (Bootstrap Your Own Latent, Grill et al., NeurIPS 2020) eliminates the need for negative pairs entirely, instead training an online network to predict the representations produced by a slowly-updating target network.Online vs. Target Network
Online Network
Parameters updated by gradient descent every step.
Consists of:
encoder → online_projector → online_predictor.
The predictor is the key asymmetry — it only exists in the online branch.Target Network
Parameters never directly trained — updated only via exponential moving average (EMA) of the online encoder and projector weights.
No predictor head. Produces stable regression targets.
Momentum Update
After each gradient step, the target network weights are updated with EMA:momentum-tau = 0.999, the target network changes very slowly — a slow-moving teacher prevents representational collapse without any explicit negative pairs.
BYOL Loss
The loss minimizes the normalized L2 distance between the online predictor’s output and the target projector’s output (both computed on opposite views):SimCLR vs. BYOL: Side-by-Side Comparison
| Property | SimCLR | BYOL |
|---|---|---|
| Algorithm | Contrastive (NT-Xent) | Momentum / Bootstrapping |
| Loss type | NT-Xent (cross-entropy over similarities) | Normalized regression (L2) |
| Negative pairs required | Yes — all other batch samples | No — target network prevents collapse |
| Temperature parameter | τ = 0.07 | — |
| Momentum parameter | — | τ = 0.999 |
| Projection dim | 128 | 256 |
| Recommended batch size | 128 | 256 |
| PTB-XL AUROC (10% labels) | 0.8717 | 0.8565 |
| PTB-XL F1 (10% labels) | 0.6448 | 0.6301 |
Fine-Tuning Protocol: Label-Efficient Classification
After pretraining, SSRL-ECG adopts a linear probing protocol to measure the quality of the learned representations in a label-efficient setting.Pretrain the encoder (SSL phase)
Run SimCLR or BYOL pretraining on the full unlabeled PTB-XL training set (folds 1–8, 17,489 samples). Only the augmented views and the SSL objective are used — no class labels are seen.
Freeze the encoder weights
Load the pretrained encoder checkpoint. All encoder parameters are frozen — gradients do not flow back into the backbone during fine-tuning. This tests whether the representation is already linearly separable for the 5-class ECG task.
Train a linear classifier on 10% labeled data
A single linear layer is added on top of the frozen 256-dim encoder output and trained on only 1,747 labeled samples (10% of the training folds). This simulates a realistic low-annotation clinical deployment scenario.
The label-efficient setting uses 1,747 labeled samples (10% of folds 1–8). This is intentionally constrained to demonstrate SSL’s advantage over fully supervised training, which achieves only AUROC 0.8606 / F1 0.5750 under the same budget.