Convolutional neural networks for image classification
Convolutional Neural Networks (CNNs) are the backbone of modern image classification. Unlike fully-connected networks, CNNs exploit spatial structure through weight sharing and local connectivity, making them far more parameter-efficient on image data.
A convolution slides a small filter (kernel) across the input, computing a dot product at each position:output[i,j]=∑m,ninput[i+m,j+n]⋅kernel[m,n]Key hyperparameters: kernel size, stride, padding, number of filters (channels).
Pooling reduces spatial dimensions while retaining dominant features. Max pooling selects the largest value in each local window, making representations more robust to small translations.
Training from scratch requires large datasets. Transfer learning repurposes a model pretrained on ImageNet (1.2 M images, 1000 classes) by replacing only the final classification head.
import torchimport torch.nn as nnimport torchvision.transforms as transformsfrom torchvision import datasets, models# Load pretrained modelmodel = models.resnet18(pretrained=True)model.fc = nn.Linear(512, num_classes)# Loss and optimizercriterion = nn.CrossEntropyLoss()optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
Freeze the backbone layers initially (param.requires_grad = False for all but model.fc), train the head for a few epochs, then unfreeze and fine-tune end-to-end with a lower learning rate (e.g., 1e-5).