Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/tommyngx/MammoMix/llms.txt

Use this file to discover all available pages before exploring further.

Training on a single mammography dataset risks producing models that overfit to that dataset’s imaging protocol, scanner characteristics, and patient demographics. momo.py merges CSAW, DMID, and DDSM into one unified dataset so that a model can learn features that generalise across acquisition conditions.

merge_datasets function

momo.py
def merge_datasets(input_dir, output_name):
    datasets = ['CSAW', 'DMID', 'DDSM']
    ...
ParameterTypeDescription
input_dirstrParent directory that contains the CSAW, DMID, and DDSM split folders (output of splitting.py).
output_namestrName of the merged output folder, created inside input_dir.
The list of datasets (['CSAW', 'DMID', 'DDSM']) is hardcoded inside merge_datasets. Any dataset folder not present in input_dir is skipped with a warning; it does not cause an error.

CLI usage

python momo.py --input_dir /path/to/splits --name MammoMix_merged
FlagRequiredDescription
--input_dirYesPath to the directory containing the individual dataset split folders.
--nameYesName for the merged output folder.
The merged dataset is written to input_dir/MammoMix_merged/ (i.e. alongside the source dataset folders).

What it does

1

Create output directory structure

Directories for train, val, and test splits are created under input_dir/output_name/, each containing images/ and labels/ subdirectories.
input_dir/
└── MammoMix_merged/
    ├── train/
    │   ├── images/
    │   └── labels/
    ├── val/
    │   ├── images/
    │   └── labels/
    └── test/
        ├── images/
        └── labels/
2

Copy files with dataset prefix

Every image and label file is copied to the merged split directory with its source dataset name prepended to the filename, preventing collisions between datasets that share identical filenames.
momo.py
new_fname = f"{dataset}_{fname}"   # e.g. "CSAW_image_001.jpg"
dst_img = os.path.join(output_root, split, 'images', new_fname)
shutil.copy2(src_img, dst_img)
For example, image_001.jpg from CSAW becomes CSAW_image_001.jpg in the merged folder, while the same filename from DDSM becomes DDSM_image_001.jpg.
3

Merge split .txt files

The train.txt, val.txt, and test.txt files from each source dataset are concatenated into a single merged .txt file. Each line is rewritten to point to the prefixed filename in the merged output directory.
momo.py
new_filename = f"{dataset}_{filename}"
new_path = f"{output_root}/{split}/images/{new_filename}"
outfile.write(new_path + '\n')

Output structure

input_dir/
├── CSAW/
├── DMID/
├── DDSM/
└── MammoMix_merged/
    ├── train.txt                  # Merged paths for all training images
    ├── val.txt
    ├── test.txt
    ├── train/
    │   ├── images/
    │   │   ├── CSAW_image_001.jpg
    │   │   ├── DMID_image_042.jpg
    │   │   ├── DDSM_image_007.png
    │   │   └── ...
    │   └── labels/
    │       ├── CSAW_image_001.xml
    │       ├── DMID_image_042.xml
    │       ├── DDSM_image_007.xml
    │       └── ...
    ├── val/
    └── test/
Run splitting.py for each individual dataset first before calling momo.py, so that the split folders and .txt files are in place.
momo.py uses shutil.copy2, not shutil.move, so the original per-dataset split directories are left intact. Merging is non-destructive.

Build docs developers (and LLMs) love