MammoMix model architecture and design

MammoMix is a two-stage breast cancer detection system built on transformer-based object detectors. It trains separate YOLOS and Deformable DETR models on three mammography datasets — CSAW, DMID, and DDSM — then combines their predictions at inference time using MoCaE (Mixture of Calibrated Experts), a post-processing ensemble that applies score calibration, Soft-NMS, and Score Voting to produce a single refined set of detections per image.

Detection pipeline

Each model follows the same end-to-end pipeline from raw image to evaluation metric:

Raw mammography image

DICOM-derived images are stored as JPEG or PNG files. Each image has a paired Pascal VOC XML annotation file containing the bounding box coordinates of any cancer region.

Augmentation

During training, the BreastCancerDataset.__getitem__ method applies an albumentations pipeline (elastic deformation, perspective distortion, flips, noise, and blur). Validation and test images use A.NoOp() — no augmentation. If augmentation removes all bounding boxes, the sample is retried automatically.

Image processor

AutoImageProcessor resizes and pads the image to a fixed square (max_size × max_size) and normalises pixel values. The processor is loaded from the HuggingFace model hub and is the same object used during training and inference.

utils.py

def get_image_processor(model_name, max_size):
    return AutoImageProcessor.from_pretrained(
        model_name,
        do_resize=True, do_pad=True, use_fast=True,
        size={'max_height': max_size, 'max_width': max_size},
        pad_size={'height': max_size, 'width': max_size},
    )

Model forward pass

The processed batch is passed to AutoModelForObjectDetection. Both YOLOS and Deformable DETR produce a set of predicted boxes and logits. The single detection class is cancer (id=0).

Post-processing

image_processor.post_process_object_detection converts raw logits and normalised box coordinates into absolute-pixel boxes filtered by a confidence threshold (default 0.5).

mAP evaluation

The compute_metrics function in evaluation.py converts predictions and targets to Pascal VOC format and computes mAP, mAP@50, mAP@75, and size-stratified variants using torchmetrics.

YOLOS

YOLOS (You Only Look One-level Series) reformulates object detection as a sequence-to-sequence task on top of a Vision Transformer. MammoMix uses the hustvl/yolos-base checkpoint from HuggingFace, loaded via AutoModelForObjectDetection:

train.py

model = AutoModelForObjectDetection.from_pretrained(
    MODEL_NAME,
    id2label={0: 'cancer'},
    label2id={'cancer': 0},
    ignore_mismatched_sizes=True,
)

The model is configured with a single label (cancer, id=0) and runs at a maximum resolution of 640 × 640. Training uses the HuggingFace Trainer API with a cosine_with_restarts scheduler, gradient accumulation of 2 steps (effective batch size 16), and fp16 mixed precision when a GPU is available.

Deformable DETR

Deformable DETR extends DETR with multi-scale deformable attention, which reduces the quadratic complexity of standard attention and improves detection of small objects. MammoMix uses the SenseTime/deformable-detr checkpoint at a maximum resolution of 800 × 800. Because Deformable DETR is more memory-intensive than YOLOS, train_detrd.py hard-codes a physical batch size of 1 with gradient accumulation of 32, producing an effective batch size of 32 while keeping GPU memory usage manageable:

train_detrd.py

batch_size = 1   # Deformable DETR is memory hungry, use 1 for safety
grad_accum = 32  # Effective batch size = 32
learning_rate = 0.0005

The model is also trained with a higher gradient-clipping norm (max_grad_norm=5.0) and a simpler cosine scheduler (no restarts) to improve training stability.

Model comparison

Property	YOLOS	Deformable DETR
HuggingFace ID	`hustvl/yolos-base`	`SenseTime/deformable-detr`
Max input size	640 × 640	800 × 800
Physical batch size	8	1
Gradient accumulation	2	32
Effective batch size	16	32
Learning rate	1e-4	5e-4
Weight decay	5e-4	1e-5
LR scheduler	`cosine_with_restarts`	`cosine`
fp16	Yes (when GPU available)	No
Num object queries	Default (100)	300
Config file	`configs/config_yolos.yaml`	`configs/config_d_detr.yaml`

MoCaE ensemble

MoCaE (Mixture of Calibrated Experts) combines the predictions of all three per-dataset YOLOS models at inference time. It has three components: ResNet-18 feature extractor. A pretrained ResNet-18 with its classification head replaced by an identity layer extracts a 512-dimensional image embedding for each image in the batch. These embeddings capture visual context independently of the detector output.

mocae.py

feature_extractor = resnet18(weights=ResNet18_Weights.IMAGENET1K_V1)
feature_extractor.fc = torch.nn.Identity()  # Remove classification head

RandomForest calibrator. For each expert model, a RandomForestRegressor (300 trees) is trained to predict the IoU between a predicted box and the nearest ground-truth box. The input to the calibrator is the concatenation of the 512-dim image embedding and the raw detector confidence score (513 features total). At inference time, the calibrator replaces the raw confidence with a calibrated score that reflects predicted localisation quality.

mocae.py

calibrator = RandomForestRegressor(n_estimators=300, n_jobs=-1)
calibrator.fit(inputs_val, ious_val)

Soft-NMS and Score Voting. After calibration, boxes from all three experts are pooled and deduplicated with Soft-NMS (Gaussian decay, sigma=0.08, iou_thresh=0.65). Score Voting then refines each surviving box by computing a weighted average of nearby boxes, where the weight is the product of the calibrated score and a Gaussian IoU similarity, with self-influence removed:

mocae.py

combined_boxes, combined_scores = soft_nms(
    torch.cat(combined_boxes, dim=0),
    torch.cat(combined_scores, dim=0),
    sigma_nms=sigma_nms, iou_nms=iou_nms,
    score_thresh=score_thresh, method=method
)
combined_boxes, combined_scores = score_voting(
    combined_boxes, combined_scores, sigma_sv=sigma_nms
)

Training YOLOS

Run YOLOS training with train.py and config_yolos.yaml.

Training Deformable DETR

Run Deformable DETR training with train_detrd.py and config_d_detr.yaml.

MoCaE ensemble inference

Combine per-dataset experts using score calibration and Soft-NMS.

Evaluation and metrics

Compute mAP, mAP@50, and mAP@75 on test splits.

Get Started

Concepts

Training

Evaluation & Inference

Data Pipeline

MammoMix model architecture and design

Detection pipeline

YOLOS

Deformable DETR

Model comparison

MoCaE ensemble

Training YOLOS

Training Deformable DETR

MoCaE ensemble inference

Evaluation and metrics

Build docs developers (and LLMs) love

Get Started

Concepts

Training

Evaluation & Inference

Data Pipeline

Documentation Index

​Detection pipeline

​YOLOS

​Deformable DETR

​Model comparison

​MoCaE ensemble

​Related pages

Training YOLOS

Training Deformable DETR

MoCaE ensemble inference

Evaluation and metrics

Build docs developers (and LLMs) love

Detection pipeline

YOLOS

Deformable DETR

Model comparison

MoCaE ensemble

Related pages