Split mammography datasets for training

splitting.py takes raw Pascal VOC annotated datasets and produces clean, reproducible train/val/test splits. Validation images are carved from the training set after cross-validating every XML annotation against bbox_annotations.csv.

`split_dataset` function

splitting.py

def split_dataset(dataset_name, raw_data_dir, splits_dir, val_split=0.2):
    ...

Parameter	Type	Default	Description
`dataset_name`	`str`	—	Name of the dataset folder (e.g. `"CSAW"`).
`raw_data_dir`	`str`	—	Root directory that contains the raw dataset folders.
`splits_dir`	`str`	—	Output root where processed splits are written.
`val_split`	`float`	`0.2`	Fraction of training images reserved for validation.

Raw directory structure

The function expects each dataset to follow this layout inside raw_data_dir:

raw_data_dir/
└── CSAW/
    ├── bbox_annotations.csv       # Ground-truth bounding boxes for cross-validation
    ├── train/
    │   ├── images/
    │   │   ├── image_001.jpg
    │   │   └── ...
    │   └── labels/
    │       ├── image_001.xml      # Pascal VOC XML annotations
    │       └── ...
    └── test/
        ├── images/
        └── labels/

Only .jpg and .png images are processed. Images without a matching .xml annotation file are skipped with a warning.

Output structure

After running, splits_dir will contain:

splits_dir/
└── CSAW/
    ├── train.txt                  # Relative paths to training images
    ├── val.txt                    # Relative paths to validation images
    ├── test.txt                   # Relative paths to test images
    ├── train/
    │   ├── images/
    │   └── labels/
    ├── val/
    │   ├── images/
    │   └── labels/
    └── test/
        ├── images/
        └── labels/

Each .txt file contains one image path per line in dataset/split/images/filename format, ready to be consumed by loader.py.

CSV cross-validation

Before copying any file, split_dataset calls validate_with_csv to confirm that the image dimensions in the XML annotation match the values recorded in bbox_annotations.csv.

splitting.py

def validate_with_csv(xml_data, csv_path):
    csv_data = pd.read_csv(csv_path)
    csv_row = csv_data[csv_data['name'] == xml_data['image_name']]
    if csv_row.empty:
        logger.warning(f"No CSV entry found for {xml_data['image_name']}")
        return True  # Proceed, but log warning

    csv_width, csv_height = csv_row['width'].iloc[0], csv_row['height'].iloc[0]
    if csv_width != xml_data['width'] or csv_height != xml_data['height']:
        logger.error(
            f"Size mismatch for {xml_data['image_name']}: "
            f"XML ({xml_data['width']}, {xml_data['height']}) "
            f"vs CSV ({csv_width}, {csv_height})"
        )
        return False
    return True

Images with a size mismatch are dropped from the split.
Images missing from the CSV are retained with a warning logged at WARNING level.

Validation split mechanics

Process train and test splits

All valid images from the raw train/ and test/ folders are copied to splits_dir and their paths are written to train.txt and test.txt.

Carve validation images from training data

sklearn.model_selection.train_test_split is called on the entries in train.txt with random_state=42 to ensure reproducibility.

splitting.py

train_images, val_images = train_test_split(
    train_images, test_size=val_split, random_state=42
)

Move validation files

Paths in val_images are rewritten from train/ to val/, then the corresponding image and XML label files are physically moved from train/ to val/.

Update split .txt files

train.txt is overwritten with the reduced training set; val.txt is written with the validation paths.

The fixed random_state=42 guarantees that re-running split_dataset on the same data always produces identical splits.

Configuration

The bottom of splitting.py defines the variables used when the script is run directly:

splitting.py

DATASETS    = ['CSAW', 'DDSM', 'DMID']
RAW_DATA_DIR = 'AJCAI25/raw'
SPLITS_DIR   = 'AJCAI25/splits'
VAL_SPLIT    = 0.2

os.makedirs(SPLITS_DIR, exist_ok=True)
for dataset in DATASETS:
    split_dataset(dataset, RAW_DATA_DIR, SPLITS_DIR, VAL_SPLIT)

Run the script directly to process all three datasets in sequence:

python splitting.py

Adjust RAW_DATA_DIR, SPLITS_DIR, or VAL_SPLIT at the top of the file before running to customise paths or the validation fraction.

Get Started

Concepts

Training

Evaluation & Inference

Data Pipeline

Split mammography datasets for training

`split_dataset` function

Raw directory structure

Output structure

CSV cross-validation

Validation split mechanics

Configuration

Build docs developers (and LLMs) love

Get Started

Concepts

Training

Evaluation & Inference

Data Pipeline

Documentation Index

​split_dataset function

​Raw directory structure

​Output structure

​CSV cross-validation

​Validation split mechanics

​Configuration

Build docs developers (and LLMs) love

`split_dataset` function

Raw directory structure

Output structure

CSV cross-validation

Validation split mechanics

Configuration