Pascal VOC XML annotation format for MammoMix

MammoMix stores bounding box annotations as Pascal VOC XML files — one .xml file per image. These files are parsed by utils.py, cross-validated against bbox_annotations.csv, and converted to COCO format before being passed to the image processor in loader.py.

MammoMix is a single-class detection task. Every annotated object is a cancer lesion and is assigned class_id = 0.

Pascal VOC XML format

Each XML file lives in the labels/ directory alongside its corresponding image in images/. The filename stem must match (e.g. image_001.jpg → image_001.xml).

image_001.xml

<annotation>
  <filename>image_001.jpg</filename>
  <size>
    <width>1024</width>
    <height>768</height>
    <depth>3</depth>
  </size>
  <object>
    <name>cancer</name>
    <bndbox>
      <xmin>120</xmin>
      <ymin>200</ymin>
      <xmax>310</xmax>
      <ymax>390</ymax>
    </bndbox>
  </object>
</annotation>

An image may contain multiple <object> elements if more than one lesion is present. MammoMix silently drops any bounding box where xmin >= xmax or ymin >= ymax.

`parse_voc_xml` return structure

parse_voc_xml in utils.py parses a single XML file and returns a plain Python dictionary:

utils.py

def parse_voc_xml(xml_path):
    tree = ET.parse(xml_path)
    root = tree.getroot()
    image_name = root.find('filename').text
    size = root.find('size')
    width = int(size.find('width').text)
    height = int(size.find('height').text)
    bboxes = []
    for obj in root.findall('object'):
        name = obj.find('name').text
        bbox = obj.find('bndbox')
        xmin = float(bbox.find('xmin').text)
        ymin = float(bbox.find('ymin').text)
        xmax = float(bbox.find('xmax').text)
        ymax = float(bbox.find('ymax').text)
        if xmin < xmax and ymin < ymax:
            bboxes.append({
                'class': name,
                'xmin': xmin, 'ymin': ymin,
                'xmax': xmax, 'ymax': ymax
            })
    return {
        'image_name': image_name,
        'width': width,
        'height': height,
        'bboxes': bboxes
    }

The returned dictionary has the following shape:

{
    'image_name': 'image_001.jpg',   # str  — value of <filename>
    'width': 1024,                   # int  — image width in pixels
    'height': 768,                   # int  — image height in pixels
    'bboxes': [                      # list — one entry per valid <object>
        {
            'class': 'cancer',       # str   — value of <name>
            'xmin': 120.0,           # float — left edge
            'ymin': 200.0,           # float — top edge
            'xmax': 310.0,           # float — right edge
            'ymax': 390.0,           # float — bottom edge
        }
    ]
}

`xml2dicts` output structure

xml2dicts converts the raw parse_voc_xml bboxes list to a format suitable for DETR/YOLOS, assigning the hardcoded class_id = 0 to every object:

utils.py

def xml2dicts(bboxes, width, height):
    detr_bboxes = []
    for bbox in bboxes:
        class_id = 0  # Single class 'cancer'
        detr_bboxes.append({
            'class_id': class_id,
            'xmin': bbox['xmin'],
            'ymin': bbox['ymin'],
            'xmax': bbox['xmax'],
            'ymax': bbox['ymax'],
        })
    return detr_bboxes

Each element in the returned list:

{
    'class_id': 0,       # int   — always 0 (cancer)
    'xmin': 120.0,       # float — left edge (pixels)
    'ymin': 200.0,       # float — top edge (pixels)
    'xmax': 310.0,       # float — right edge (pixels)
    'ymax': 390.0,       # float — bottom edge (pixels)
}

`bbox_annotations.csv` cross-validation file

Every dataset directory must contain a bbox_annotations.csv file at its root. splitting.py reads this file to verify that the image dimensions recorded in each XML annotation match a trusted ground-truth source before the image is included in any split. Expected columns:

Column	Type	Description
`name`	`str`	Image filename (must match `<filename>` in the XML).
`width`	`int`	Expected image width in pixels.
`height`	`int`	Expected image height in pixels.

If the width or height in the XML does not match the CSV, the image is dropped from the split and an error is logged. If the image is absent from the CSV entirely, it is kept with a warning.

COCO-style format for the image processor

loader.py converts the output of xml2dicts into the COCO annotation format expected by AutoImageProcessor before passing it to the model:

loader.py

annotations = {
    'image_id': idx,
    'annotations': [
        {
            'image_id': idx,
            'category_id': label,                            # class_id (0 = cancer)
            'bbox': [
                bbox[0],               # xmin
                bbox[1],               # ymin
                bbox[2] - bbox[0],     # width  (xmax - xmin)
                bbox[3] - bbox[1],     # height (ymax - ymin)
            ],
            'area': (bbox[2] - bbox[0]) * (bbox[3] - bbox[1]),
            'iscrowd': 0,
        }
        for bbox, label in zip(bboxes, labels)
    ]
}

Key points:

bbox uses [xmin, ymin, width, height] format — not the [xmin, ymin, xmax, ymax] format stored in the XML.
area is computed as width × height of the bounding box.
iscrowd is always 0; MammoMix does not use crowd annotations.
category_id maps directly to class_id from xml2dicts (always 0).

This dictionary is passed directly to image_processor(images=image, annotations=annotations, ...), which handles resizing, padding, and normalisation before the tensor is fed to the model.

Get Started

Concepts

Training

Evaluation & Inference

Data Pipeline

Pascal VOC XML annotation format for MammoMix

Pascal VOC XML format

`parse_voc_xml` return structure

`xml2dicts` output structure

`bbox_annotations.csv` cross-validation file

COCO-style format for the image processor

Build docs developers (and LLMs) love

Get Started

Concepts

Training

Evaluation & Inference

Data Pipeline

Documentation Index

​Pascal VOC XML format

​parse_voc_xml return structure

​xml2dicts output structure

​bbox_annotations.csv cross-validation file

​COCO-style format for the image processor

Build docs developers (and LLMs) love

Pascal VOC XML format

`parse_voc_xml` return structure

`xml2dicts` output structure

`bbox_annotations.csv` cross-validation file

COCO-style format for the image processor