Documentation Index
Fetch the complete documentation index at: https://mintlify.com/tommyngx/MammoMix/llms.txt
Use this file to discover all available pages before exploring further.
splitting.py takes raw Pascal VOC annotated datasets and produces clean, reproducible train/val/test splits. Validation images are carved from the training set after cross-validating every XML annotation against bbox_annotations.csv.
split_dataset function
splitting.py
| Parameter | Type | Default | Description |
|---|---|---|---|
dataset_name | str | — | Name of the dataset folder (e.g. "CSAW"). |
raw_data_dir | str | — | Root directory that contains the raw dataset folders. |
splits_dir | str | — | Output root where processed splits are written. |
val_split | float | 0.2 | Fraction of training images reserved for validation. |
Raw directory structure
The function expects each dataset to follow this layout insideraw_data_dir:
Only
.jpg and .png images are processed. Images without a matching .xml annotation file are skipped with a warning.Output structure
After running,splits_dir will contain:
.txt file contains one image path per line in dataset/split/images/filename format, ready to be consumed by loader.py.
CSV cross-validation
Before copying any file,split_dataset calls validate_with_csv to confirm that the image dimensions in the XML annotation match the values recorded in bbox_annotations.csv.
splitting.py
- Images with a size mismatch are dropped from the split.
- Images missing from the CSV are retained with a warning logged at
WARNINGlevel.
Validation split mechanics
Process train and test splits
All valid images from the raw
train/ and test/ folders are copied to splits_dir and their paths are written to train.txt and test.txt.Carve validation images from training data
sklearn.model_selection.train_test_split is called on the entries in train.txt with random_state=42 to ensure reproducibility.splitting.py
Move validation files
Paths in
val_images are rewritten from train/ to val/, then the corresponding image and XML label files are physically moved from train/ to val/.Configuration
The bottom ofsplitting.py defines the variables used when the script is run directly:
splitting.py
RAW_DATA_DIR, SPLITS_DIR, or VAL_SPLIT at the top of the file before running to customise paths or the validation fraction.