MoCaE: Mixture of Calibrated Experts ensemble

MoCaE (Mixture of Calibrated Experts) is MammoMix’s ensemble strategy. Instead of relying on a single model, it trains one YOLOS expert per mammography dataset, calibrates each expert’s raw confidence scores against true IoU, then merges all predictions with Soft-NMS and Score Voting.

Pipeline overview

Train one expert per dataset

Each expert is a YOLOS model fine-tuned on a specific mammography dataset. MoCaE ships with three experts configured in CONFIGS:

mocae.py

CONFIGS = [
    {
        'dataset_name': 'CSAW',
        'saved_dir': './Weights/yolos_CSAW',
        'model': None,
        'image_processor': None,
        'calibrator': None,
        'calibrator_dataset': None,
    },
    {
        'dataset_name': 'DDSM',
        'saved_dir': './Weights/yolos_DDSM',
        'model': None,
        'image_processor': None,
        'calibrator': None,
        'calibrator_dataset': None,
    },
    {
        'dataset_name': 'DMID',
        'saved_dir': './Weights/yolos_DMID',
        'model': None,
        'image_processor': None,
        'calibrator': None,
        'calibrator_dataset': None,
    },
]

At startup, mocae.py loads each expert with AutoModelForObjectDetection.from_pretrained pointing at saved_dir, sets the model to .eval(), and moves it to the available device.

All three weight directories (./Weights/yolos_CSAW, ./Weights/yolos_DDSM, ./Weights/yolos_DMID) must exist and contain a valid Hugging Face checkpoint before running mocae.py. The script will raise a OSError if any path is missing.

Build calibration datasets and train calibrators

A raw confidence score from a detector does not reliably predict IoU with the ground truth. MoCaE fits a RandomForest calibrator per expert to map (image_embedding, confidence) → predicted IoU.Feature constructionFor every predicted box that passes the 0.5 confidence threshold, MoCaE constructs a 513-dimensional input vector:

mocae.py

calibrator_input = np.concatenate([
    [embedding.cpu().numpy() for _ in range(len(pred_boxes))],  # 512-dim ResNet-18 embedding
    pred_scores.numpy().reshape(-1, 1)                           # 1-dim raw confidence
], axis=-1)
# shape: [num_boxes, 513]

The 512-dimensional image embedding comes from a ResNet-18 with its classification head replaced by torch.nn.Identity():

mocae.py

feature_extractor = resnet18(weights=ResNet18_Weights.IMAGENET1K_V1)
feature_extractor.fc = torch.nn.Identity()
feature_extractor.eval()
feature_extractor.to(device)

Calibrator training

mocae.py

calibrator = RandomForestRegressor(n_estimators=300, n_jobs=-1)
calibrator.fit(inputs_val, ious_val)

calibrator_path = os.path.join(config['saved_dir'], 'calibrator.pkl')
with open(calibrator_path, 'wb') as f:
    pickle.dump(calibrator, f)
config['calibrator'] = calibrator

The calibrator is fitted on validation-split IoU values and serialised as calibrator.pkl inside the expert’s saved_dir. At inference time it replaces the expert’s raw pred_scores with predicted IoU values, making scores from different experts directly comparable.

Combine predictions with Soft-NMS and Score Voting

At inference time combine_predictions merges all expert outputs for each image.Soft-NMSStandard NMS hard-removes any box whose IoU with the top-scoring box exceeds a threshold. Soft-NMS decays scores instead, preserving potentially valid detections:

mocae.py

combined_boxes, combined_scores = soft_nms(
    torch.cat(combined_boxes, dim=0),
    torch.cat(combined_scores, dim=0),
    sigma_nms=0.08,
    iou_nms=0.65,
    score_thresh=0,
    method='gaussian',
)

Parameter	Value	Effect
`sigma_nms`	`0.08`	Gaussian decay width — smaller value means faster score suppression for overlapping boxes
`iou_nms`	`0.65`	IoU threshold used only in `linear` mode; ignored in default `gaussian` mode
`method`	`'gaussian'`	Score decay formula: `score × exp(−iou² / sigma_nms)`

Score VotingAfter Soft-NMS, Score Voting refines each surviving box’s coordinates by taking a weighted average of all nearby boxes, weighted by IoU similarity and calibrated score:

mocae.py

combined_boxes, combined_scores = score_voting(
    combined_boxes, combined_scores, sigma_sv=0.08
)

The sigma_sv parameter controls the width of the IoU-based Gaussian weight:

mocae.py

def score_voting(boxes, scores, sigma_sv=0.1):
    iou_matrix = box_iou(boxes, boxes)
    iou_weights = torch.exp(-((1 - iou_matrix) ** 2) / sigma_sv)
    iou_weights.fill_diagonal_(0)   # exclude self-influence
    weights = scores.unsqueeze(1) * iou_weights
    numerator = (weights.unsqueeze(2) * boxes.unsqueeze(1)).sum(dim=1)
    denominator = weights.sum(dim=1, keepdim=True)
    refined_boxes = numerator / (denominator + 1e-8)
    refined_scores = (scores.unsqueeze(1) * iou_weights).sum(dim=1) / (iou_weights.sum(dim=1) + 1e-8)
    return refined_boxes, refined_scores

Score Voting is especially effective when multiple experts detect the same lesion at slightly different box coordinates. The weighted average nudges the final box toward the consensus location, reducing localisation error.

Running the full ensemble

After all three steps above complete, mocae.py evaluates the combined pipeline on each dataset’s test split:

mocae.py

models = [config['model'] for config in CONFIGS]
calibrators = [config['calibrator'] for config in CONFIGS]
image_processors = {config['dataset_name']: config['image_processor'] for config in CONFIGS}

for dataset_name in tqdm(DATASET_NAMES):
    print(combine_predictions(image_processors, models, calibrators, dataset_name, SPLITS_DIR))

combine_predictions returns the output of MeanAveragePrecision.compute() from torchmetrics, which includes map, map_50, map_75, and size-based variants.

Component summary

Expert models

Three YOLOS models fine-tuned on CSAW, DDSM, and DMID mammography datasets, each stored in ./Weights/.

ResNet-18 embeddings

512-dimensional image features extracted from a pretrained ResNet-18 with its classification head removed, used as calibrator input.

Calibrators

Per-expert RandomForestRegressor (300 trees) that maps (embedding, confidence) to predicted IoU. Saved as calibrator.pkl.

Soft-NMS + Score Voting

Gaussian Soft-NMS (sigma_nms=0.08, iou_nms=0.65) suppresses duplicates softly; Score Voting (sigma_sv=0.08) refines surviving box coordinates.

Get Started

Concepts

Training

Evaluation & Inference

Data Pipeline

MoCaE: Mixture of Calibrated Experts ensemble

Pipeline overview

Running the full ensemble

Component summary

Expert models

ResNet-18 embeddings

Calibrators

Soft-NMS + Score Voting

Build docs developers (and LLMs) love

Get Started

Concepts

Training

Evaluation & Inference

Data Pipeline

Documentation Index

​Pipeline overview

​Running the full ensemble

​Component summary

Expert models

ResNet-18 embeddings

Calibrators

Soft-NMS + Score Voting

Build docs developers (and LLMs) love

Pipeline overview

Running the full ensemble

Component summary