MammoMix is a two-stage breast cancer detection system built on transformer-based object detectors. It trains separate YOLOS and Deformable DETR models on three mammography datasets — CSAW, DMID, and DDSM — then combines their predictions at inference time using MoCaE (Mixture of Calibrated Experts), a post-processing ensemble that applies score calibration, Soft-NMS, and Score Voting to produce a single refined set of detections per image.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/tommyngx/MammoMix/llms.txt
Use this file to discover all available pages before exploring further.
Detection pipeline
Each model follows the same end-to-end pipeline from raw image to evaluation metric:Raw mammography image
DICOM-derived images are stored as JPEG or PNG files. Each image has a paired Pascal VOC XML annotation file containing the bounding box coordinates of any cancer region.
Augmentation
During training, the
BreastCancerDataset.__getitem__ method applies an albumentations pipeline (elastic deformation, perspective distortion, flips, noise, and blur). Validation and test images use A.NoOp() — no augmentation. If augmentation removes all bounding boxes, the sample is retried automatically.Image processor
AutoImageProcessor resizes and pads the image to a fixed square (max_size × max_size) and normalises pixel values. The processor is loaded from the HuggingFace model hub and is the same object used during training and inference.utils.py
Model forward pass
The processed batch is passed to
AutoModelForObjectDetection. Both YOLOS and Deformable DETR produce a set of predicted boxes and logits. The single detection class is cancer (id=0).Post-processing
image_processor.post_process_object_detection converts raw logits and normalised box coordinates into absolute-pixel boxes filtered by a confidence threshold (default 0.5).YOLOS
YOLOS (You Only Look One-level Series) reformulates object detection as a sequence-to-sequence task on top of a Vision Transformer. MammoMix uses thehustvl/yolos-base checkpoint from HuggingFace, loaded via AutoModelForObjectDetection:
train.py
cancer, id=0) and runs at a maximum resolution of 640 × 640. Training uses the HuggingFace Trainer API with a cosine_with_restarts scheduler, gradient accumulation of 2 steps (effective batch size 16), and fp16 mixed precision when a GPU is available.
Deformable DETR
Deformable DETR extends DETR with multi-scale deformable attention, which reduces the quadratic complexity of standard attention and improves detection of small objects. MammoMix uses theSenseTime/deformable-detr checkpoint at a maximum resolution of 800 × 800.
Because Deformable DETR is more memory-intensive than YOLOS, train_detrd.py hard-codes a physical batch size of 1 with gradient accumulation of 32, producing an effective batch size of 32 while keeping GPU memory usage manageable:
train_detrd.py
max_grad_norm=5.0) and a simpler cosine scheduler (no restarts) to improve training stability.
Model comparison
| Property | YOLOS | Deformable DETR |
|---|---|---|
| HuggingFace ID | hustvl/yolos-base | SenseTime/deformable-detr |
| Max input size | 640 × 640 | 800 × 800 |
| Physical batch size | 8 | 1 |
| Gradient accumulation | 2 | 32 |
| Effective batch size | 16 | 32 |
| Learning rate | 1e-4 | 5e-4 |
| Weight decay | 5e-4 | 1e-5 |
| LR scheduler | cosine_with_restarts | cosine |
| fp16 | Yes (when GPU available) | No |
| Num object queries | Default (100) | 300 |
| Config file | configs/config_yolos.yaml | configs/config_d_detr.yaml |
MoCaE ensemble
MoCaE (Mixture of Calibrated Experts) combines the predictions of all three per-dataset YOLOS models at inference time. It has three components: ResNet-18 feature extractor. A pretrained ResNet-18 with its classification head replaced by an identity layer extracts a 512-dimensional image embedding for each image in the batch. These embeddings capture visual context independently of the detector output.mocae.py
RandomForestRegressor (300 trees) is trained to predict the IoU between a predicted box and the nearest ground-truth box. The input to the calibrator is the concatenation of the 512-dim image embedding and the raw detector confidence score (513 features total). At inference time, the calibrator replaces the raw confidence with a calibrated score that reflects predicted localisation quality.
mocae.py
sigma=0.08, iou_thresh=0.65). Score Voting then refines each surviving box by computing a weighted average of nearby boxes, where the weight is the product of the calibrated score and a Gaussian IoU similarity, with self-influence removed:
mocae.py
Related pages
Training YOLOS
Run YOLOS training with
train.py and config_yolos.yaml.Training Deformable DETR
Run Deformable DETR training with
train_detrd.py and config_d_detr.yaml.MoCaE ensemble inference
Combine per-dataset experts using score calibration and Soft-NMS.
Evaluation and metrics
Compute mAP, mAP@50, and mAP@75 on test splits.