evaluation.py — metrics and inference functions

evaluation.py provides mAP computation, HuggingFace Trainer-compatible metric callbacks, and a standalone inference loop for DETR-based mammography models.

`ModelOutput`

A lightweight dataclass that mirrors the output structure expected by AutoImageProcessor.post_process_object_detection. Use it to wrap raw logits and predicted boxes when calling the image processor outside a HuggingFace model forward pass.

from evaluation import ModelOutput
import torch

output = ModelOutput(
    logits=torch.zeros(1, 100, 2),
    pred_boxes=torch.zeros(1, 100, 4),
)

logits

torch.Tensor

Class logits produced by the detection head. Shape (B, num_queries, num_classes + 1) where the last dimension includes the no-object class.

pred_boxes

torch.Tensor

Predicted bounding boxes in YOLO normalised format (cx, cy, w, h). Shape (B, num_queries, 4).

`convert_bbox_yolo_to_pascal`

Converts bounding boxes from normalised YOLO format to absolute Pascal VOC coordinates.

from evaluation import convert_bbox_yolo_to_pascal
import torch

yolo_boxes = torch.tensor([[0.5, 0.5, 0.4, 0.3]])  # (cx, cy, w, h)
pascal_boxes = convert_bbox_yolo_to_pascal(yolo_boxes, image_size=(640, 640))
# tensor([[192., 224., 448., 416.]])  — (x_min, y_min, x_max, y_max)

Parameters

boxes

torch.Tensor

required

Bounding boxes in YOLO format (cx, cy, w, h) with values normalised to [0, 1]. Shape (N, 4).

image_size

Tuple[int, int]

required

Target image dimensions as (height, width) used to scale the normalised coordinates to absolute pixel values.

Returns

boxes

torch.Tensor

Bounding boxes in Pascal VOC format (x_min, y_min, x_max, y_max) in absolute pixel coordinates. Shape (N, 4).

`compute_metrics`

Computes mean average precision (mAP) and related detection metrics from HuggingFace EvalPrediction objects. Decorated with @torch.no_grad().

from evaluation import compute_metrics

metrics = compute_metrics(
    evaluation_results=eval_prediction,
    image_processor=image_processor,
    threshold=0.5,
    id2label={0: "cancer"},
    max_size=640,
)
# {"map": 0.42, "map_50": 0.71, "map_75": 0.38, ...}

Parameters

evaluation_results

EvalPrediction

required

HuggingFace EvalPrediction object with .predictions (batched model outputs) and .label_ids (batched ground-truth annotations) as populated by Trainer.evaluate().

image_processor

AutoImageProcessor

required

The same image processor used during training. Called with post_process_object_detection to convert raw model outputs to scored, filtered bounding boxes.

threshold

float

default:"0.0"

Confidence threshold for filtering predicted boxes before metric computation. Boxes with scores below this value are discarded.

id2label

dict

default:"None"

Mapping from integer class id to string label name (e.g. {0: "cancer"}). Passed through to the image processor’s post-processing step.

max_size

int

default:"640"

The spatial dimension used as both height and width when constructing target_sizes for post-processing. Should match the pad_size used during preprocessing.

Returns

metrics

Mapping[str, float]

Dictionary of mAP metrics with keys prefixed by "map". Common keys include:

Show metric keys

map

float

Mean average precision averaged over IoU thresholds [0.50:0.95:0.05].

map_50

float

mAP at IoU threshold 0.50.

map_75

float

mAP at IoU threshold 0.75.

map_small

float

mAP for small objects (area < 32² px).

map_medium

float

mAP for medium objects (32² ≤ area < 96² px).

map_large

float

mAP for large objects (area ≥ 96² px).

`get_eval_compute_metrics_fn`

Returns a functools.partial of compute_metrics pre-configured for MammoMix’s single-class cancer detection task. Pass the result directly as the compute_metrics argument to Trainer.

from evaluation import get_eval_compute_metrics_fn
from transformers import Trainer, TrainingArguments

compute_metrics_fn = get_eval_compute_metrics_fn(image_processor)

trainer = Trainer(
    model=model,
    args=TrainingArguments(...),
    compute_metrics=compute_metrics_fn,
)

Parameters

image_processor

AutoImageProcessor

required

Image processor used to post-process predictions inside compute_metrics.

Returns

Callable[[EvalPrediction], Mapping[str, float]]

A partial of compute_metrics with threshold=0.5 and id2label={0: "cancer"} already bound. Accepts a single EvalPrediction argument.

`calculate_custom_map_metrics`

An alternative mAP implementation that processes raw model output objects directly, bypassing the EvalPrediction serialisation path used by compute_metrics. Useful when evaluating outside Trainer or when the standard path encounters tensor-shape issues.

from evaluation import calculate_custom_map_metrics

metrics = calculate_custom_map_metrics(
    predictions=all_predictions,
    targets=all_targets,
    image_processor=image_processor,
    device=device,
    max_size=640,
)

All tensors are moved to CPU before computing metrics for compatibility with torchmetrics. Returns zeroed metrics as a fallback if any exception is raised.

Parameters

predictions

list

required

List of model output objects. Each element must expose .logits and .pred_boxes attributes (e.g. instances of ModelOutput or native HuggingFace model outputs).

targets

list[dict]

required

List of ground-truth annotation dicts, each with keys "boxes" (YOLO format Tensor) and "class_labels" (Tensor).

image_processor

AutoImageProcessor

required

Image processor used to post-process predictions via post_process_object_detection with threshold=0.5.

device

torch.device

required

The device on which intermediate tensors are created. Final metric computation is performed on CPU.

max_size

int

default:"640"

Image spatial dimension used as target size during post-processing.

Returns

metrics

dict[str, float]

mAP metrics dict with the same keys as compute_metrics ("map", "map_50", "map_75", "map_small", "map_medium", "map_large"). All values are Python float. Returns all zeros on error.

`run_model_inference_with_map`

End-to-end inference loop: iterates over a test dataset, collects model predictions, and returns mAP metrics. This is the primary entry point for evaluating a trained checkpoint on a held-out split.

from evaluation import run_model_inference_with_map

metrics = run_model_inference_with_map(
    model=model,
    test_dataset=test_dataset,
    image_processor=image_processor,
    device=device,
    batch_size=8,
)
print(metrics)
# {"map": 0.45, "map_50": 0.73, ...}

Parameters

model

nn.Module

required

A trained HuggingFace detection model (e.g. AutoModelForObjectDetection). The function calls model.eval() before running inference and wraps the loop in torch.no_grad().

test_dataset

BreastCancerDataset

required

A BreastCancerDataset instance initialised with split="test". Wrapped in a DataLoader internally.

image_processor

AutoImageProcessor

required

Image processor forwarded to calculate_custom_map_metrics for post-processing.

device

torch.device

required

Device to move pixel_values and label tensors to before the forward pass.

batch_size

int

default:"8"

Number of images per inference batch.

Returns

metrics

dict[str, float]

mAP metrics dict returned by calculate_custom_map_metrics, aggregated over the full test set.

Core Modules

Ensemble & Post-processing

evaluation.py — metrics and inference functions

`ModelOutput`

`convert_bbox_yolo_to_pascal`

Parameters

Returns

`compute_metrics`

Parameters

Returns

`get_eval_compute_metrics_fn`

Parameters

Returns

`calculate_custom_map_metrics`

Parameters

Returns

`run_model_inference_with_map`

Parameters

Returns

Build docs developers (and LLMs) love

Core Modules

Ensemble & Post-processing

Documentation Index

​ModelOutput

​convert_bbox_yolo_to_pascal

​Parameters

​Returns

​compute_metrics

​Parameters

​Returns

​get_eval_compute_metrics_fn

​Parameters

​Returns

​calculate_custom_map_metrics

​Parameters

​Returns

​run_model_inference_with_map

​Parameters

​Returns

Build docs developers (and LLMs) love

`ModelOutput`

`convert_bbox_yolo_to_pascal`

Parameters

Returns

`compute_metrics`

Parameters

Returns

`get_eval_compute_metrics_fn`

Parameters

Returns

`calculate_custom_map_metrics`

Parameters

Returns

`run_model_inference_with_map`

Parameters

Returns