MammoMix supports three publicly available mammography datasets: CSAW (a Swedish Karolinska screening cohort), DMID (Digital Mammography Image Database), and DDSM (Digital Database for Screening Mammography). Each dataset uses the same directory layout, annotation format, and single-class label scheme, so the same training code handles all three without modification. Dataset selection is controlled at runtime by a singleDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/tommyngx/MammoMix/llms.txt
Use this file to discover all available pages before exploring further.
--dataset flag.
Datasets
CSAW
CSAW is a large-scale screening mammography dataset collected at Karolinska University Hospital in Stockholm, Sweden. It contains full-field digital mammography (FFDM) images from routine screening examinations, annotated with bounding boxes around visible cancer regions. CSAW represents a real-world clinical distribution with a wide range of tissue densities.DMID
DMID (Digital Mammography Image Database) is a curated research dataset of digitised mammograms with radiologist-verified bounding box annotations. It covers a variety of lesion types and imaging conditions typical of digital mammography systems.DDSM
DDSM (Digital Database for Screening Mammography) is one of the oldest and most widely used mammography benchmarks, originally collected at Massachusetts General Hospital. It contains scanned film mammograms with detailed case-level and pixel-level annotations. MammoMix uses the bounding-box subset of DDSM annotations, converted to Pascal VOC XML format.Directory structure
splitting.py converts each raw dataset into a standardised layout under splits_dir. The raw source is expected to have separate train/ and test/ subdirectories, each containing images/ and labels/ folders alongside a bbox_annotations.csv for cross-validation:
splitting.py, the processed output under splits_dir looks like this:
val/ split is carved out of train/ automatically (default 20%) using a stratified random split with random_state=42, and the corresponding image and label files are physically moved to the val/ subdirectory.
Split file format
Each.txt file lists one image path per line, relative to splits_dir:
BreastCancerDataset reads the appropriate .txt file and prefixes each line with splits_dir to obtain the absolute path:
loader.py
Selecting a dataset for training
Pass--dataset to train.py or train_detrd.py to override the dataset name in the config file:
--dataset is omitted, the value from the config file’s dataset.name key is used:
config_yolos.yaml
How dataset_name is used in BreastCancerDataset
BreastCancerDataset uses dataset_name to locate the correct split file and to build label paths from image paths. Labels are stored in a sibling labels/ directory alongside each images/ directory; the class maps the image path by replacing images with labels and swapping the extension to .xml:
loader.py
__getitem__, the label path is derived automatically:
loader.py
Annotation format
Annotations use Pascal VOC XML. Each.xml file contains the image dimensions and one <object> element per cancer region:
utils.parse_voc_xml reads these files, and utils.xml2dicts converts each bounding box to a dictionary with class_id=0 regardless of the object name in the XML:
utils.py
Merging datasets
momo.py provides a merge_datasets utility that copies all three datasets into a single output folder with dataset-prefixed filenames (e.g. CSAW_image_001.jpg) and merges the .txt split files. Use this to train a single model on the combined corpus:
input_dir and can be passed to train.py with --dataset MammoMix_all.
All three datasets use a single class:
cancer with class_id=0. Multi-class detection is not supported. Every bounding box in every XML annotation is assigned class_id=0 by xml2dicts, irrespective of the object name stored in the XML file.