Evaluate MammoMix models on test datasets

MammoMix supports two evaluation workflows: automatic evaluation hooked into the Hugging Face Trainer loop, and a standalone function for running inference and computing mAP on any test dataset.

Evaluation approaches

Automatic evaluation via Trainer

During training, pass the metrics function returned by get_eval_compute_metrics_fn to the Trainer as compute_metrics. The Trainer calls it after each evaluation epoch with an EvalPrediction object containing batched predictions and ground-truth labels.

evaluation.py

from transformers import Trainer, TrainingArguments
from evaluation import get_eval_compute_metrics_fn

compute_metrics = get_eval_compute_metrics_fn(image_processor)

training_args = TrainingArguments(
    output_dir="./output",
    eval_do_concat_batches=False,
    metric_for_best_model="eval_map_50",
    # ...other args
)

trainer = Trainer(
    model=model,
    args=training_args,
    compute_metrics=compute_metrics,
    # ...
)
trainer.train()

You must set eval_do_concat_batches=False in TrainingArguments. The compute_metrics function iterates over individual batches from evaluation_results.predictions and evaluation_results.label_ids. Concatenating batches before this step produces incorrect image-size tensors and breaks post-processing.

Standalone inference with mAP evaluation

Use run_model_inference_with_map to evaluate any trained model against a test dataset outside the Trainer loop. This is the recommended path for final benchmark runs.

from evaluation import run_model_inference_with_map

metrics = run_model_inference_with_map(
    model=model,
    test_dataset=test_dataset,
    image_processor=image_processor,
    device=device,
    batch_size=8,
)
print(metrics)
# {'map': 0.42, 'map_50': 0.71, 'map_75': 0.38, ...}

Signature

evaluation.py

def run_model_inference_with_map(
    model,           # Trained AutoModelForObjectDetection
    test_dataset,    # torch Dataset yielding pixel_values + labels
    image_processor, # AutoImageProcessor used during training
    device,          # torch.device
    batch_size=8,    # Images per forward pass
) -> dict[str, float]:
    ...

Internally the function:

Wraps test_dataset in a DataLoader using collate_fn.
Runs model.eval() and collects outputs under torch.no_grad().
Delegates metric computation to calculate_custom_map_metrics.

`get_eval_compute_metrics_fn`

evaluation.py

def get_eval_compute_metrics_fn(image_processor):
    return partial(
        compute_metrics, image_processor=image_processor,
        threshold=0.5, id2label={0: 'cancer'}
    )

The factory returns a partially applied version of compute_metrics with two fixed parameters:

Parameter	Value	Purpose
`threshold`	`0.5`	Confidence cutoff — boxes below this score are discarded before mAP accumulation
`id2label`	`{0: 'cancer'}`	Single-class mapping used by the image processor during post-processing

Pass the returned callable directly to Trainer(compute_metrics=...).

Bounding box conversion: YOLO → Pascal VOC

Ground-truth labels are stored and fed to YOLOS in YOLO format: (x_center, y_center, width, height) normalised to [0, 1]. Before computing IoU-based metrics, MammoMix converts all boxes to Pascal VOC format: (x_min, y_min, x_max, y_max) in absolute pixel coordinates.

evaluation.py

from transformers.image_transforms import center_to_corners_format

def convert_bbox_yolo_to_pascal(boxes, image_size):
    boxes = center_to_corners_format(boxes)          # (cx,cy,w,h) -> (x1,y1,x2,y2), still normalised
    height, width = image_size
    boxes = boxes * torch.tensor([[width, height, width, height]])  # scale to pixels
    return boxes

The conversion runs for both targets and model predictions before they are passed to torchmetrics.

Output metrics

compute_metrics returns a dictionary filtered to keys that start with map:

{
    'map':        0.42,  # COCO-style mAP averaged over IoU thresholds 0.50–0.95
    'map_50':     0.71,  # mAP at IoU threshold 0.50
    'map_75':     0.38,  # mAP at IoU threshold 0.75
    'map_small':  0.09,  # mAP for objects with area < 32² px
    'map_medium': 0.35,  # mAP for objects with area 32²–96² px
    'map_large':  0.58,  # mAP for objects with area > 96² px
}

map_per_class is explicitly removed before returning because MammoMix is a single-class detector (cancer only). See Object detection metrics for a full explanation of each key.

`ModelOutput` dataclass

Post-processing via image_processor.post_process_object_detection requires a model output object with specific attributes. When running inference manually, MammoMix wraps raw tensors in a lightweight dataclass:

evaluation.py

from dataclasses import dataclass
import torch

@dataclass
class ModelOutput:
    logits: torch.Tensor    # [batch, num_queries, num_classes + 1]
    pred_boxes: torch.Tensor  # [batch, num_queries, 4] in YOLO format

This mirrors the shape of a real YolosObjectDetectionOutput and satisfies the image processor’s interface without importing the full model output class.

Get Started

Concepts

Training

Evaluation & Inference

Data Pipeline

Evaluate MammoMix models on test datasets

Evaluation approaches

`get_eval_compute_metrics_fn`

Bounding box conversion: YOLO → Pascal VOC

Output metrics

`ModelOutput` dataclass

Build docs developers (and LLMs) love

Get Started

Concepts

Training

Evaluation & Inference

Data Pipeline

Documentation Index

​Evaluation approaches

​get_eval_compute_metrics_fn

​Bounding box conversion: YOLO → Pascal VOC

​Output metrics

​ModelOutput dataclass

Build docs developers (and LLMs) love

Evaluation approaches

`get_eval_compute_metrics_fn`

Bounding box conversion: YOLO → Pascal VOC

Output metrics

`ModelOutput` dataclass