loader.py — BreastCancerDataset and collate_fn

loader.py provides the PyTorch Dataset implementation for mammography data and the collate function used with DataLoader. It handles split-based file discovery, VOC XML annotation parsing, Albumentations augmentation, and DETR-compatible encoding.

`BreastCancerDataset`

A torch.utils.data.Dataset that loads mammography images and bounding-box annotations for DETR-based object detection models. Augmentation is applied automatically for the train split.

from loader import BreastCancerDataset
from transformers import AutoImageProcessor

image_processor = AutoImageProcessor.from_pretrained(
    "hustvl/yolos-base",
    do_resize=True,
    do_pad=True,
    use_fast=True,
    size={"max_height": 640, "max_width": 640},
    pad_size={"height": 640, "width": 640},
)

dataset = BreastCancerDataset(
    split="train",
    splits_dir="AJCAI25/splits",
    dataset_name="CSAW",
    image_processor=image_processor,
)

Constructor parameters

split

string

required

Dataset split to load. Must be one of "train", "val", or "test". Raises ValueError for any other value. The "train" split activates the full Albumentations augmentation pipeline; "val" and "test" apply a no-op identity transform.

splits_dir

string

required

Path to the root directory that contains per-dataset split files. The constructor expects a file at {splits_dir}/{dataset_name}/{split}.txt. Each line in that file is a relative path to one image. Raises FileNotFoundError if the file does not exist.

dataset_name

string

required

Name of the dataset subdirectory inside splits_dir. Supported values used in MammoMix are "CSAW", "DMID", and "DDSM".

image_processor

AutoImageProcessor

required

A HuggingFace AutoImageProcessor instance (e.g. from hustvl/yolos-base or a DETR checkpoint). Used to resize, pad, and normalise images, and to encode COCO-format annotations into the tensors expected by DETR.

Return value — `getitem`

Each call to dataset[idx] returns a Python dict with the following fields.

pixel_values

torch.Tensor

Preprocessed image tensor of shape (3, H, W) after resizing, padding, and normalisation. The batch dimension from the image processor is squeezed out.

labels

dict

DETR-compatible annotation dict produced by the image processor. Contains at minimum:

Show labels fields

boxes

torch.Tensor

Bounding boxes in YOLO normalised format (cx, cy, w, h) with values in [0, 1]. Shape (N, 4) where N is the number of valid annotations after augmentation.

class_labels

torch.Tensor

Integer class indices of shape (N,). Always 0 (cancer) in the current dataset.

Training augmentation pipeline

When split="train", get_transforms() returns an albumentations.Compose with the following transforms applied to both the image and bounding boxes:

Transform	Key parameters
`ElasticTransform`	`alpha=50`, `sigma=5`, `p=0.5`
`Perspective`	`scale=(0.05, 0.1)`, `p=0.5`
`HorizontalFlip`	`p=0.5`
`Rotate`	`limit=10`, `p=0.5`
`RandomScale`	`scale_limit=0.2`, `p=0.5`
`Affine`	scale, translate, rotate, shear, `p=0.5`
`RandomBrightnessContrast`	`brightness_limit=0.2`, `contrast_limit=0.2`, `p=0.5`
`GaussNoise`	`std_range=(0.05, 0.05)`, `p=0.5`
`GaussianBlur`	`p=0.5`

Bounding-box params: format pascal_voc, min_area=25, min_visibility=0.1, clip=True. If all boxes are removed by augmentation, the item is retried automatically.

`collate_fn`

Collates a list of dataset samples into a batch suitable for a DataLoader.

from torch.utils.data import DataLoader
from loader import BreastCancerDataset, collate_fn

loader = DataLoader(
    dataset,
    batch_size=16,
    shuffle=True,
    num_workers=4,
    pin_memory=True,
    collate_fn=collate_fn,
)

Parameters

batch

list[dict]

required

A list of sample dicts as returned by BreastCancerDataset.__getitem__. Each dict must contain pixel_values and labels, and may optionally contain pixel_mask.

Return value

pixel_values

torch.Tensor

Stacked image tensor of shape (B, 3, H, W) produced by torch.stack.

labels

list[dict]

List of per-image label dicts (length B). Kept as a Python list because each image may have a different number of bounding boxes and DETR expects this structure directly.

pixel_mask

torch.Tensor

Stacked attention mask of shape (B, H, W), present only when pixel_mask exists in the first sample. Each value is 1 for real pixels and 0 for padding.

Core Modules

Ensemble & Post-processing

loader.py — BreastCancerDataset and collate_fn

`BreastCancerDataset`

Constructor parameters

Return value — `getitem`

Training augmentation pipeline

`collate_fn`

Parameters

Return value

Build docs developers (and LLMs) love

Core Modules

Ensemble & Post-processing

Documentation Index

​BreastCancerDataset

​Constructor parameters

​Return value — __getitem__

​Training augmentation pipeline

​collate_fn

​Parameters

​Return value

Build docs developers (and LLMs) love

`BreastCancerDataset`

Constructor parameters

Return value — `getitem`

Training augmentation pipeline

`collate_fn`

Parameters

Return value