Train Deformable DETR for mammography detection

Deformable DETR (SenseTime/deformable-detr) extends the original DETR architecture with multi-scale deformable attention, making it especially effective for detecting small, diffuse lesions in high-resolution mammograms. Because the model is significantly more memory-intensive than YOLOS, the train_detrd.py script uses a dedicated training configuration: batch_size=1 with gradient_accumulation_steps=32 to reach an effective batch size of 32, disabled mixed precision (fp16=False) for numerical stability, and gradient norm clipping at 5.0 to handle the long attention spans.

Quickstart

python train_detrd.py --config configs/config_d_detr.yaml --dataset CSAW

python train_detrd.py --config configs/config_d_detr.yaml --dataset CSAW

Overriding the epoch count

python train_detrd.py --config configs/config_d_detr.yaml --dataset CSAW --epoch 100

Deformable DETR requires at least 16 GB of VRAM. Training with the default settings (max_size=800, batch_size=1, gradient_accumulation_steps=32) was validated on a single A100-80 GB. On smaller GPUs, reduce dataset.max_size to 640 and verify that CUDA does not OOM during the first evaluation step.

Memory optimization strategy

The pipeline explicitly trades per-step throughput for memory headroom:

Parameter	Value	Effect
`per_device_train_batch_size`	`1`	Minimum VRAM per step
`gradient_accumulation_steps`	`32`	Effective batch size = 32
`fp16`	`False`	Avoids NaN losses in deformable attention
`gradient_checkpointing`	`False`	Disabled to avoid recomputation overhead
`max_grad_norm`	`5.0`	Clips exploding gradients
`dataloader_num_workers`	`0`	Prevents shared-memory conflicts with large images

Gradient checkpointing is intentionally disabled. Deformable DETR’s multi-scale attention layers are memory-heavy, but recomputing them during the backward pass introduces enough latency that training wall-clock time can increase by 40–60 % with only modest VRAM savings. The batch_size=1 + accumulation strategy achieves a better trade-off.

Key differences from YOLOS training

The table below summarizes where train_detrd.py diverges from the YOLOS pipeline in train.py:

Setting	YOLOS (`train.py`)	Deformable DETR (`train_detrd.py`)
Learning rate	`0.0001`	`0.0005`
Mixed precision	`fp16=True` (auto)	`fp16=False`
Best-model metric	`eval_map_50`	`eval_loss`
`greater_is_better`	`True`	`False`
Logging strategy	`epoch`	`steps` (every 10 steps)
`save_total_limit`	`1`	`2`

The higher learning rate (0.0005) is intentional: Deformable DETR’s deformable attention modules need a larger gradient signal to adapt the sampling offsets from their ImageNet-pretrained initialization to the narrow distribution of mammography lesions. Using eval_loss as the best-model metric (rather than eval_map_50) is a practical choice. Because the Deformable DETR trainer does not attach the custom mAP compute_metrics function during training (the compute_metrics line is commented out in train_detrd.py:189), validation mAP is computed separately after training using calculate_custom_map_metrics. Tracking eval_loss ensures the best checkpoint is still selected automatically.

What happens during training

Dataset and processor loading

BreastCancerDataset is loaded for train and val splits, identical to the YOLOS pipeline. The image processor uses max_size=800 by default (configurable via dataset.max_size), which is the standard resolution for Deformable DETR.

Model loading

load_deformable_detr_model calls AutoModelForObjectDetection.from_pretrained('SenseTime/deformable-detr') with id2label={0: 'cancer'}. If a deformable_detr section is present in the config, num_queries is read from it (default 300). If loading fails with the custom config, the function falls back to a minimal configuration automatically.

Training

Trainer.train() runs with the memory-optimized arguments. Loss and learning rate are logged to W&B every 10 steps. Checkpoints are saved at the end of each epoch, keeping the best 2 by eval_loss.

Model saving

After training the best checkpoint is saved to:

../deformable_detr_{DATASET_NAME}_{DDMMYY}

Custom mAP evaluation

calculate_custom_map_metrics runs inference over the entire test split and computes mAP using torchvision’s box_iou. Results are printed to stdout:

Custom Test mAP Results:
------------------------------
map: 0.381
map_50: 0.612
map_75: 0.344

Model output path

../deformable_detr_{DATASET_NAME}_{DDMMYY}

For example, a DMID run completed on 12 May 2026 saves to:

../deformable_detr_DMID_120526/

Get Started

Concepts

Training

Evaluation & Inference

Data Pipeline

Train Deformable DETR for mammography detection

Quickstart

Overriding the epoch count

Memory optimization strategy

Key differences from YOLOS training

What happens during training

Model output path

Build docs developers (and LLMs) love

Get Started

Concepts

Training

Evaluation & Inference

Data Pipeline

Documentation Index

​Quickstart

​Overriding the epoch count

​Memory optimization strategy

​Key differences from YOLOS training

​What happens during training

​Model output path

Build docs developers (and LLMs) love

Quickstart

Overriding the epoch count

Memory optimization strategy

Key differences from YOLOS training

What happens during training

Model output path