Supported datasets: CSAW, DMID, and DDSM

MammoMix supports three publicly available mammography datasets: CSAW (a Swedish Karolinska screening cohort), DMID (Digital Mammography Image Database), and DDSM (Digital Database for Screening Mammography). Each dataset uses the same directory layout, annotation format, and single-class label scheme, so the same training code handles all three without modification. Dataset selection is controlled at runtime by a single --dataset flag.

Datasets

CSAW

CSAW is a large-scale screening mammography dataset collected at Karolinska University Hospital in Stockholm, Sweden. It contains full-field digital mammography (FFDM) images from routine screening examinations, annotated with bounding boxes around visible cancer regions. CSAW represents a real-world clinical distribution with a wide range of tissue densities.

DMID

DMID (Digital Mammography Image Database) is a curated research dataset of digitised mammograms with radiologist-verified bounding box annotations. It covers a variety of lesion types and imaging conditions typical of digital mammography systems.

DDSM

DDSM (Digital Database for Screening Mammography) is one of the oldest and most widely used mammography benchmarks, originally collected at Massachusetts General Hospital. It contains scanned film mammograms with detailed case-level and pixel-level annotations. MammoMix uses the bounding-box subset of DDSM annotations, converted to Pascal VOC XML format.

Directory structure

splitting.py converts each raw dataset into a standardised layout under splits_dir. The raw source is expected to have separate train/ and test/ subdirectories, each containing images/ and labels/ folders alongside a bbox_annotations.csv for cross-validation:

raw_data_dir/
  CSAW/
    train/
      images/
      labels/     (Pascal VOC XML files)
    test/
      images/
      labels/
    bbox_annotations.csv
  DMID/
    ...
  DDSM/
    ...

After running splitting.py, the processed output under splits_dir looks like this:

splits_dir/
  CSAW/
    train/
      images/
      labels/
    val/
      images/
      labels/
    test/
      images/
      labels/
    train.txt
    val.txt
    test.txt
  DMID/
    ...
  DDSM/
    ...

The val/ split is carved out of train/ automatically (default 20%) using a stratified random split with random_state=42, and the corresponding image and label files are physically moved to the val/ subdirectory.

Split file format

Each .txt file lists one image path per line, relative to splits_dir:

CSAW/train/images/image_001.jpg
CSAW/train/images/image_002.jpg
...

BreastCancerDataset reads the appropriate .txt file and prefixes each line with splits_dir to obtain the absolute path:

loader.py

split_file = os.path.join(splits_dir, dataset_name, f'{split}.txt')
with open(split_file, 'r') as f:
    self.image_paths = [os.path.join(splits_dir, line.strip()) for line in f.readlines()]

Selecting a dataset for training

Pass --dataset to train.py or train_detrd.py to override the dataset name in the config file:

# Train YOLOS on CSAW
python train.py --config configs/config_yolos.yaml --dataset CSAW

# Train YOLOS on DDSM
python train.py --config configs/config_yolos.yaml --dataset DDSM

# Train Deformable DETR on DMID
python train_detrd.py --config configs/config_d_detr.yaml --dataset DMID

If --dataset is omitted, the value from the config file’s dataset.name key is used:

config_yolos.yaml

dataset:
  name: CSAW
  splits_dir: ../dataset
  max_size: 640

How dataset_name is used in BreastCancerDataset

BreastCancerDataset uses dataset_name to locate the correct split file and to build label paths from image paths. Labels are stored in a sibling labels/ directory alongside each images/ directory; the class maps the image path by replacing images with labels and swapping the extension to .xml:

loader.py

class BreastCancerDataset(Dataset):
    def __init__(self, split, splits_dir, dataset_name, image_processor):
        self.split = split
        self.splits_dir = splits_dir
        self.dataset_name = dataset_name

        split_file = os.path.join(splits_dir, dataset_name, f'{split}.txt')
        with open(split_file, 'r') as f:
            self.image_paths = [os.path.join(splits_dir, line.strip()) for line in f.readlines()]

Inside __getitem__, the label path is derived automatically:

loader.py

label_path = base.replace('images', 'labels') + '.xml'

Annotation format

Annotations use Pascal VOC XML. Each .xml file contains the image dimensions and one <object> element per cancer region:

<annotation>
  <filename>image_001.jpg</filename>
  <size>
    <width>2048</width>
    <height>2560</height>
    <depth>3</depth>
  </size>
  <object>
    <name>cancer</name>
    <bndbox>
      <xmin>512</xmin>
      <ymin>768</ymin>
      <xmax>640</xmax>
      <ymax>896</ymax>
    </bndbox>
  </object>
</annotation>

utils.parse_voc_xml reads these files, and utils.xml2dicts converts each bounding box to a dictionary with class_id=0 regardless of the object name in the XML:

utils.py

def xml2dicts(bboxes, width, height):
    detr_bboxes = []
    for bbox in bboxes:
        class_id = 0  # Single class 'cancer'
        detr_bboxes.append({
            'class_id': class_id,
            'xmin': bbox['xmin'],
            'ymin': bbox['ymin'],
            'xmax': bbox['xmax'],
            'ymax': bbox['ymax']
        })
    return detr_bboxes

Merging datasets

momo.py provides a merge_datasets utility that copies all three datasets into a single output folder with dataset-prefixed filenames (e.g. CSAW_image_001.jpg) and merges the .txt split files. Use this to train a single model on the combined corpus:

python momo.py --input_dir /path/to/splits --name MammoMix_all

The merged dataset folder appears alongside the individual dataset folders under input_dir and can be passed to train.py with --dataset MammoMix_all.

All three datasets use a single class: cancer with class_id=0. Multi-class detection is not supported. Every bounding box in every XML annotation is assigned class_id=0 by xml2dicts, irrespective of the object name stored in the XML file.

Get Started

Concepts

Training

Evaluation & Inference

Data Pipeline

Supported datasets: CSAW, DMID, and DDSM

Datasets

CSAW

DMID

DDSM

Directory structure

Split file format

Selecting a dataset for training

How dataset_name is used in BreastCancerDataset

Annotation format

Merging datasets

Build docs developers (and LLMs) love

Get Started

Concepts

Training

Evaluation & Inference

Data Pipeline

Documentation Index

​Datasets

​CSAW

​DMID

​DDSM

​Directory structure

​Split file format

​Selecting a dataset for training

​How dataset_name is used in BreastCancerDataset

​Annotation format

​Merging datasets

Build docs developers (and LLMs) love

Datasets

CSAW

DMID

DDSM

Directory structure

Split file format

Selecting a dataset for training

How dataset_name is used in BreastCancerDataset

Annotation format

Merging datasets