TorchGeo implements a variety of model architectures optimized for remote sensing tasks. Models are organized by their primary use case: classification backbones, segmentation, change detection, and foundation models.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/torchgeo/torchgeo/llms.txt
Use this file to discover all available pages before exploring further.
Classification Backbones
Pre-trained encoders suitable for transfer learning and feature extraction.ResNet
Residual Networks for image classification and feature extraction. Available Variants:resnet18: 18-layer ResNet (11.7M parameters)resnet50: 50-layer ResNet (25.6M parameters)resnet152: 152-layer ResNet (60.2M parameters)
- Residual connections for training deep networks
- Multiple pre-trained weights for different sensors
- Support for arbitrary input channels via
in_chansparameter - Based on timm implementation
Vision Transformer (ViT)
Transformer-based architecture for image classification. Available Variants:vit_small_patch16_224: Small ViT with 16x16 patches (22M parameters)vit_base_patch16_224: Base ViT with 16x16 patches (86M parameters)vit_large_patch16_224: Large ViT with 16x16 patches (304M parameters)vit_huge_patch14_224: Huge ViT with 14x14 patches (632M parameters)vit_small_patch14_dinov2: Small DINOv2 ViT with 14x14 patchesvit_base_patch14_dinov2: Base DINOv2 ViT with 14x14 patches
- Pure transformer architecture without convolutions
- Self-attention mechanisms for global context
- Extensive pre-trained weights from SSL4EO-S12 and SSL4EO-L
- Support for MAE, DINO, MoCo, and other SSL methods
Swin Transformer
Hierarchical vision transformer with shifted windows. Available Variants:swin_t: Tiny Swin Transformerswin_s: Small Swin Transformerswin_b: Base Swin Transformerswin_v2_t: Swin Transformer V2 Tinyswin_v2_b: Swin Transformer V2 Base
- Hierarchical feature maps at multiple scales
- Shifted window attention for efficiency
- Pre-trained on SatlasPretrain dataset
- Support for both RGB and multispectral inputs
Foundation Models
Large-scale models trained on diverse geospatial data with specialized capabilities.DOFA (Dynamic One-For-All)
A dynamic model that adapts to any number of spectral bands via wavelength-conditioned convolutions. Available Variants:dofa_small_patch16_224: Small DOFA (22M parameters)dofa_base_patch16_224: Base DOFA (86M parameters)dofa_large_patch16_224: Large DOFA (304M parameters)dofa_huge_patch14_224: Huge DOFA (632M parameters)
- Dynamic channel adaptation: Works with any spectral bands by conditioning on wavelengths
- Trained on SatlasPretrain, Five-Billion-Pixels, and HySpecNet-11k
- Transformer architecture with dynamic weight generator
- Pre-trained with MAE (Masked Autoencoding)
Presto
Pretrained Remote Sensing Transformer for Sentinel-1/2 time series. Key Features:- Temporal transformer for satellite image time series
- Encoder-decoder architecture with masked token prediction
- Multi-modal: Sentinel-1 SAR + Sentinel-2 optical
- Includes auxiliary inputs: Dynamic World labels, lat/lon, month
- Pre-trained on LEM (Presto pretraining dataset)
- S1: Sentinel-1 (VV, VH)
- S2_RGB, S2_Red_Edge, S2_NIR, S2_SWIR: Sentinel-2 bands
- ERA5: Climate reanalysis data
- SRTM: Elevation data
- NDVI: Vegetation index
CopernicusFM
Copernicus Foundation Model for multi-temporal satellite imagery. Available Variants:copernicusfm_base: Base CopernicusFM model
- Multi-temporal Sentinel-2 processing
- Foundation model trained on Copernicus data
- Supports various downstream tasks
ScaleMAE
Scale-aware Masked Autoencoder for multi-resolution imagery. Available Variants:scalemae_large_patch16: Large ScaleMAE with patch size 16
- Multi-scale masked autoencoding
- Handles images at different spatial resolutions
- Vision transformer backbone
Other Foundation Models
CROMA (Contrastive Multi-modal Alignment):croma_base: Base CROMA modelcroma_large: Large CROMA model- Multi-modal contrastive learning
aurora_swin_unet: Swin-UNet for weather forecasting
panopticon_vitb14: Vision transformer for global monitoring
earthloc: Model for geographic location prediction
tessera: Tessellated earth observation model
tilenet: Tile-based representation learning
Segmentation Models
Dense prediction models for pixel-wise classification tasks.U-Net
U-shaped encoder-decoder architecture for semantic segmentation. Key Features:- Encoder-decoder with skip connections
- Multiple encoder backbones (EfficientNet, ResNet, etc.)
- Pre-trained weights for field boundary detection
- Based on segmentation_models_pytorch (smp)
- Field boundary detection (2-class and 3-class)
- Various EfficientNet encoders (B3, B5, B7)
- Commercial and non-commercial licenses
FCN (Fully Convolutional Network)
Simple fully convolutional architecture for semantic segmentation. Key Features:- 5-layer fully convolutional architecture
- LeakyReLU activations
- Lightweight and fast
- Customizable number of filters
FarSeg
Foreground-Aware Relation Network for object segmentation. Key Features:- ResNet backbone with FPN (Feature Pyramid Network)
- Foreground-scene relation module
- Designed for building, road, ship segmentation
- Can be extended for change detection
Change Detection Models
Models specialized for detecting changes between bi-temporal images.ChangeStar
Change detection model combining segmentation and change prediction. Key Features:- Combines semantic segmentation with change detection
- ChangeMixin module for binary change prediction
- Bi-directional change detection (t1→t2 and t2→t1)
- Architecture reusability: works with any segmentation backbone
ChangeStarFarSeg
Pre-configured ChangeStar with FarSeg backbone. Usage:FCSiamDiff / FCSiamConc
Siamese fully convolutional networks for change detection. Variants:FCSiamDiff: Difference-based fusionFCSiamConc: Concatenation-based fusion
- Siamese architecture with shared weights
- Process bi-temporal images
- Lightweight and efficient
ChangeViT
Vision transformer-based change detection model. Usage:Time Series Models
Models for temporal sequence processing.ConvLSTM
Convolutional LSTM for spatiotemporal sequence modeling. Key Features:- Combines CNN and LSTM for spatiotemporal patterns
- Processes sequences of images
- Maintains spatial structure through convolutions
LTAE
Lightweight Temporal Attention Encoder. Key Features:- Attention-based temporal encoding
- Lightweight and efficient
- Designed for satellite time series
Other Models
RCF / MOSAIKS
Random Convolutional Features for large-scale geospatial analysis. Usage:BTC
Behavioral cloning model for trajectory prediction. Usage:Model Selection Guide
For Image Classification
Small datasets (under 10k images):- ResNet18/50 with pre-trained weights
- ViT Small with MAE pre-training
- ResNet152 or ViT Large/Huge
- DOFA for multi-sensor/multi-spectral
- Presto for Sentinel-1/2 time series
- ConvLSTM or LTAE
For Semantic Segmentation
General segmentation:- U-Net with EfficientNet encoder
- FarSeg for object-aware segmentation
- U-Net with FTW pre-trained weights
- Swin Transformer with Satlas weights
For Change Detection
Binary change:- ChangeStar with FarSeg
- FCSiamDiff/FCSiamConc
- ChangeStar for multi-class change
- ChangeViT for transformer-based
For Multi-Sensor Fusion
Any spectral bands:- DOFA (wavelength-conditioned)
- Presto (temporal)
- Separate encoders with late fusion
For Transfer Learning
Best pre-trained weights:- ResNet50: SENTINEL2_ALL_MOCO or SENTINEL2_ALL_DINO
- ViT: SENTINEL2_ALL_MAE or SENTINEL2_ALL_FGMAE
- DOFA: DOFA_MAE for any bands
- FTW weights for agricultural fields
- Satlas weights for global mapping
- SSL4EO-L for Landsat applications