Merge multiple mammography datasets with momo.py

Training on a single mammography dataset risks producing models that overfit to that dataset’s imaging protocol, scanner characteristics, and patient demographics. momo.py merges CSAW, DMID, and DDSM into one unified dataset so that a model can learn features that generalise across acquisition conditions.

`merge_datasets` function

momo.py

def merge_datasets(input_dir, output_name):
    datasets = ['CSAW', 'DMID', 'DDSM']
    ...

Parameter	Type	Description
`input_dir`	`str`	Parent directory that contains the `CSAW`, `DMID`, and `DDSM` split folders (output of `splitting.py`).
`output_name`	`str`	Name of the merged output folder, created inside `input_dir`.

The list of datasets (['CSAW', 'DMID', 'DDSM']) is hardcoded inside merge_datasets. Any dataset folder not present in input_dir is skipped with a warning; it does not cause an error.

CLI usage

python momo.py --input_dir /path/to/splits --name MammoMix_merged

Flag	Required	Description
`--input_dir`	Yes	Path to the directory containing the individual dataset split folders.
`--name`	Yes	Name for the merged output folder.

The merged dataset is written to input_dir/MammoMix_merged/ (i.e. alongside the source dataset folders).

What it does

Create output directory structure

Directories for train, val, and test splits are created under input_dir/output_name/, each containing images/ and labels/ subdirectories.

input_dir/
└── MammoMix_merged/
    ├── train/
    │   ├── images/
    │   └── labels/
    ├── val/
    │   ├── images/
    │   └── labels/
    └── test/
        ├── images/
        └── labels/

Copy files with dataset prefix

Every image and label file is copied to the merged split directory with its source dataset name prepended to the filename, preventing collisions between datasets that share identical filenames.

momo.py

new_fname = f"{dataset}_{fname}"   # e.g. "CSAW_image_001.jpg"
dst_img = os.path.join(output_root, split, 'images', new_fname)
shutil.copy2(src_img, dst_img)

For example, image_001.jpg from CSAW becomes CSAW_image_001.jpg in the merged folder, while the same filename from DDSM becomes DDSM_image_001.jpg.

Merge split .txt files

The train.txt, val.txt, and test.txt files from each source dataset are concatenated into a single merged .txt file. Each line is rewritten to point to the prefixed filename in the merged output directory.

momo.py

new_filename = f"{dataset}_{filename}"
new_path = f"{output_root}/{split}/images/{new_filename}"
outfile.write(new_path + '\n')

Output structure

input_dir/
├── CSAW/
├── DMID/
├── DDSM/
└── MammoMix_merged/
    ├── train.txt                  # Merged paths for all training images
    ├── val.txt
    ├── test.txt
    ├── train/
    │   ├── images/
    │   │   ├── CSAW_image_001.jpg
    │   │   ├── DMID_image_042.jpg
    │   │   ├── DDSM_image_007.png
    │   │   └── ...
    │   └── labels/
    │       ├── CSAW_image_001.xml
    │       ├── DMID_image_042.xml
    │       ├── DDSM_image_007.xml
    │       └── ...
    ├── val/
    └── test/

Run splitting.py for each individual dataset first before calling momo.py, so that the split folders and .txt files are in place.

momo.py uses shutil.copy2, not shutil.move, so the original per-dataset split directories are left intact. Merging is non-destructive.

Get Started

Concepts

Training

Evaluation & Inference

Data Pipeline

Merge multiple mammography datasets with momo.py

`merge_datasets` function

CLI usage

What it does

Output structure

Build docs developers (and LLMs) love

Get Started

Concepts

Training

Evaluation & Inference

Data Pipeline

Documentation Index

​merge_datasets function

​CLI usage

​What it does

​Output structure

Build docs developers (and LLMs) love

`merge_datasets` function

CLI usage

What it does

Output structure