Benchmarking SSL vs Supervised Learning with Limited Labels

Annotating ECG recordings for cardiovascular disease requires a cardiologist to review each 10-second trace and assign one or more of the five diagnostic superclasses. At scale this is expensive, time-consuming, and subject to inter-annotator variability. The label scarcity benchmark in SSRL-ECG quantifies exactly how much SSL pretraining helps when only a small fraction of those annotations is available at fine-tuning time.

The Label Scarcity Problem in Clinical ECG Annotation

The PTB-XL training set contains 17,489 labelled ECG recordings. In practice, a new clinical deployment might start with far fewer confirmed labels — perhaps a few hundred from a local institution. The critical question is whether an encoder pretrained on the unlabelled signal structure (via SimCLR or BYOL) can compensate for the lack of ground-truth annotations during supervised fine-tuning. The benchmark sweeps label fractions from 1% to 100%, corresponding to as few as ~175 samples up to the full 17,489. At 10% (1,747 samples), SimCLR fine-tuning already achieves AUROC 0.8717 — matching the supervised baseline trained on the complete labelled set.

10% of the 17,489-sample training set equals 1,747 labelled samples, spread across five cardiovascular classes with natural class imbalance (NORM 3.32× more frequent than HYP).

`LabelScarcityBenchmark` Class

LabelScarcityBenchmark orchestrates three parallel training tracks at each label fraction:

Supervised baseline — an ImprovedECGClassifier trained from scratch using only the available labels, with no pretraining.
SSL fine-tuned — the pretrained SSL encoder with all weights unfrozen during fine-tuning on the labelled subset.
SSL frozen — the pretrained SSL encoder with the encoder weights frozen; only the classification head is trained (linear probing).

Results for each track and seed are written to label_scarcity_results/label_scarcity_benchmark.json.

from pathlib import Path
from ssrl_ecg.label_scarcity_benchmark import LabelScarcityBenchmark

benchmark = LabelScarcityBenchmark(
    data_root=Path("data/PTB-XL"),
    checkpoint_dir=Path("checkpoints"),
    results_dir=Path("label_scarcity_results"),
)

results = benchmark.run_label_scarcity_benchmark(
    label_fractions=[0.01, 0.05, 0.1, 0.25, 1.0],
    seeds=[42, 52, 62],
    epochs=40,
)

Constructor Parameters

data_root

Path

required

Root directory of the PTB-XL dataset. Passed to load_ptbxl_metadata and PTBXLRecordDataset.

checkpoint_dir

Path

required

Directory containing pretrained SSL checkpoints. The benchmark looks for ssl_masked.pt by default. If the checkpoint is absent, only the supervised track runs.

results_dir

Path

required

Output directory for the JSON results file. Created automatically if it does not exist.

`run_label_scarcity_benchmark()` Parameters

label_fractions

list[float]

default:"[0.01, 0.05, 0.1, 0.25, 1.0]"

List of label fractions to sweep. Each value is the proportion of the training set used, e.g. 0.1 = 10% = 1,747 samples.

seeds

list[int]

default:"[42, 52, 62]"

Random seeds for reproducible sampling of labelled indices. Mean and standard deviation are reported across seeds.

epochs

int

default:"40"

Maximum training epochs per run. Early stopping with patience of 10 epochs is applied automatically.

Running the Benchmark

The benchmark is integrated into the run_experiments.py script at the project root, which runs all phases in sequence:

Train the SSL encoder

Pretrain a SimCLR encoder on the full PTB-XL training set (unlabelled).

python -m ssrl_ecg.train_ssl_simclr \
  --data-root data/PTB-XL \
  --epochs 20 \
  --batch-size 128 \
  --out checkpoints/ssl_masked.pt

Run the label scarcity benchmark

Sweep label fractions and compare SSL vs supervised across three seeds.

python -m ssrl_ecg.label_scarcity_benchmark

This uses the defaults: data_root=data/PTB-XL, checkpoint_dir=checkpoints, results_dir=label_scarcity_results, fractions [0.05, 0.1, 0.25, 1.0], seeds [42, 52, 62], epochs 30.

Review results

Results are printed to stdout as a summary table and saved to label_scarcity_results/label_scarcity_benchmark.json.

[1.0% Labeled Data]
  supervised           AUROC: 0.7234±0.0187 | F1: 0.3812±0.0241
  ssl_finetuned        AUROC: 0.8051±0.0143 | F1: 0.5234±0.0198
  ssl_frozen           AUROC: 0.7889±0.0156 | F1: 0.4967±0.0211

[10.0% Labeled Data]
  supervised           AUROC: 0.8606±0.0034 | F1: 0.5750±0.0121
  ssl_finetuned        AUROC: 0.8717±0.0032 | F1: 0.6448±0.0181
  ssl_frozen           AUROC: 0.8512±0.0041 | F1: 0.5981±0.0163

Why SSL Gains Are Largest at Low Label Fractions

At high label fractions (e.g. 100%), a supervised model trained from scratch has enough data to learn a good representation on its own. The SSL advantage narrows but remains positive. At very low label fractions (1–5%), a supervised model struggles to capture the temporal structure of ECG waveforms from a few hundred examples. The SSL encoder, pretrained on tens of thousands of unlabelled signals, already encodes heartbeat morphology, frequency bands, and inter-channel correlations — the fine-tuning stage only needs to attach a linear head on top.

At 1–5% label fractions, expect SSL fine-tuned to outperform supervised by 5–8 AUROC points. The gap shrinks progressively as more labels are added, converging near the 100% mark.

The finetune_ssl method supports an optional freeze_encoder=True flag for linear probing, which is faster but slightly weaker than full fine-tuning at low label fractions:

# Full fine-tuning: all encoder weights updated
model = benchmark.finetune_ssl(ssl_encoder, label_fraction=0.05, freeze_encoder=False)

# Linear probing: only the classification head is trained
model = benchmark.finetune_ssl(ssl_encoder, label_fraction=0.05, freeze_encoder=True)

Data Split Details

Training pool

17,489 samples (PTB-XL folds 1–8). Label fraction is applied here via stratified sampling with sample_labelled_indices.

Validation set

2,154 samples (PTB-XL fold 9). Used for early stopping and learning rate scheduling during all training runs.

Test set

2,194 samples (PTB-XL fold 10). Held out completely; used only for final metric computation via evaluate_model.

Label fraction reference

1% ≈ 175 samples · 5% ≈ 875 samples · 10% ≈ 1,747 samples · 25% ≈ 4,372 samples · 100% = 17,489 samples

The benchmark can be time-consuming at high epoch counts and many seeds. For a quick exploratory run, use label_fractions=[0.05, 0.1, 1.0], seeds=[42], and epochs=15 to get indicative results in under an hour on a modern GPU.

Get Started

Concepts

Training

Evaluation & Analysis

Guides

Benchmarking SSL vs Supervised Learning with Limited Labels

The Label Scarcity Problem in Clinical ECG Annotation

`LabelScarcityBenchmark` Class

Constructor Parameters

`run_label_scarcity_benchmark()` Parameters

Running the Benchmark

Why SSL Gains Are Largest at Low Label Fractions

Data Split Details

Training pool

Validation set

Test set

Label fraction reference

Build docs developers (and LLMs) love

Get Started

Concepts

Training

Evaluation & Analysis

Guides

Documentation Index

​The Label Scarcity Problem in Clinical ECG Annotation

​LabelScarcityBenchmark Class

​Constructor Parameters

​run_label_scarcity_benchmark() Parameters

​Running the Benchmark

​Why SSL Gains Are Largest at Low Label Fractions

​Data Split Details

Training pool

Validation set

Test set

Label fraction reference

Build docs developers (and LLMs) love

The Label Scarcity Problem in Clinical ECG Annotation

`LabelScarcityBenchmark` Class

Constructor Parameters

`run_label_scarcity_benchmark()` Parameters

Running the Benchmark

Why SSL Gains Are Largest at Low Label Fractions

Data Split Details