Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/tommyngx/MammoMix/llms.txt

Use this file to discover all available pages before exploring further.

splitting.py takes raw Pascal VOC annotated datasets and produces clean, reproducible train/val/test splits. Validation images are carved from the training set after cross-validating every XML annotation against bbox_annotations.csv.

split_dataset function

splitting.py
def split_dataset(dataset_name, raw_data_dir, splits_dir, val_split=0.2):
    ...
ParameterTypeDefaultDescription
dataset_namestrName of the dataset folder (e.g. "CSAW").
raw_data_dirstrRoot directory that contains the raw dataset folders.
splits_dirstrOutput root where processed splits are written.
val_splitfloat0.2Fraction of training images reserved for validation.

Raw directory structure

The function expects each dataset to follow this layout inside raw_data_dir:
raw_data_dir/
└── CSAW/
    ├── bbox_annotations.csv       # Ground-truth bounding boxes for cross-validation
    ├── train/
    │   ├── images/
    │   │   ├── image_001.jpg
    │   │   └── ...
    │   └── labels/
    │       ├── image_001.xml      # Pascal VOC XML annotations
    │       └── ...
    └── test/
        ├── images/
        └── labels/
Only .jpg and .png images are processed. Images without a matching .xml annotation file are skipped with a warning.

Output structure

After running, splits_dir will contain:
splits_dir/
└── CSAW/
    ├── train.txt                  # Relative paths to training images
    ├── val.txt                    # Relative paths to validation images
    ├── test.txt                   # Relative paths to test images
    ├── train/
    │   ├── images/
    │   └── labels/
    ├── val/
    │   ├── images/
    │   └── labels/
    └── test/
        ├── images/
        └── labels/
Each .txt file contains one image path per line in dataset/split/images/filename format, ready to be consumed by loader.py.

CSV cross-validation

Before copying any file, split_dataset calls validate_with_csv to confirm that the image dimensions in the XML annotation match the values recorded in bbox_annotations.csv.
splitting.py
def validate_with_csv(xml_data, csv_path):
    csv_data = pd.read_csv(csv_path)
    csv_row = csv_data[csv_data['name'] == xml_data['image_name']]
    if csv_row.empty:
        logger.warning(f"No CSV entry found for {xml_data['image_name']}")
        return True  # Proceed, but log warning

    csv_width, csv_height = csv_row['width'].iloc[0], csv_row['height'].iloc[0]
    if csv_width != xml_data['width'] or csv_height != xml_data['height']:
        logger.error(
            f"Size mismatch for {xml_data['image_name']}: "
            f"XML ({xml_data['width']}, {xml_data['height']}) "
            f"vs CSV ({csv_width}, {csv_height})"
        )
        return False
    return True
  • Images with a size mismatch are dropped from the split.
  • Images missing from the CSV are retained with a warning logged at WARNING level.

Validation split mechanics

1

Process train and test splits

All valid images from the raw train/ and test/ folders are copied to splits_dir and their paths are written to train.txt and test.txt.
2

Carve validation images from training data

sklearn.model_selection.train_test_split is called on the entries in train.txt with random_state=42 to ensure reproducibility.
splitting.py
train_images, val_images = train_test_split(
    train_images, test_size=val_split, random_state=42
)
3

Move validation files

Paths in val_images are rewritten from train/ to val/, then the corresponding image and XML label files are physically moved from train/ to val/.
4

Update split .txt files

train.txt is overwritten with the reduced training set; val.txt is written with the validation paths.
The fixed random_state=42 guarantees that re-running split_dataset on the same data always produces identical splits.

Configuration

The bottom of splitting.py defines the variables used when the script is run directly:
splitting.py
DATASETS    = ['CSAW', 'DDSM', 'DMID']
RAW_DATA_DIR = 'AJCAI25/raw'
SPLITS_DIR   = 'AJCAI25/splits'
VAL_SPLIT    = 0.2

os.makedirs(SPLITS_DIR, exist_ok=True)
for dataset in DATASETS:
    split_dataset(dataset, RAW_DATA_DIR, SPLITS_DIR, VAL_SPLIT)
Run the script directly to process all three datasets in sequence:
python splitting.py
Adjust RAW_DATA_DIR, SPLITS_DIR, or VAL_SPLIT at the top of the file before running to customise paths or the validation fraction.

Build docs developers (and LLMs) love