Skip to main content
Convolutional Neural Networks (CNNs) are the backbone of modern image classification. Unlike fully-connected networks, CNNs exploit spatial structure through weight sharing and local connectivity, making them far more parameter-efficient on image data.

CNN architecture

A typical CNN consists of three stages:

1. Convolutional layers

A convolution slides a small filter (kernel) across the input, computing a dot product at each position: output[i,j]=m,ninput[i+m,j+n]kernel[m,n]\text{output}[i,j] = \sum_{m,n} \text{input}[i+m,\, j+n] \cdot \text{kernel}[m,n] Key hyperparameters: kernel size, stride, padding, number of filters (channels).

2. Pooling layers

Pooling reduces spatial dimensions while retaining dominant features. Max pooling selects the largest value in each local window, making representations more robust to small translations.

3. Fully connected layers

After several conv+pool blocks, the feature maps are flattened and passed through standard dense layers to produce class logits.
Input (H×W×3)
  → Conv + ReLU  → Feature maps
  → MaxPool      → Reduced maps
  → Conv + ReLU
  → MaxPool
  → Flatten
  → Linear + ReLU
  → Linear (num_classes)
  → Softmax

Training a CNN with PyTorch

1

Prepare data loaders

Use torchvision.datasets and DataLoader to load and batch your images with on-the-fly augmentations.
import torchvision.transforms as transforms
from torchvision import datasets
from torch.utils.data import DataLoader

transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                         std=[0.229, 0.224, 0.225])
])

train_dataset = datasets.ImageFolder('data/train', transform=transform)
train_loader  = DataLoader(train_dataset, batch_size=32, shuffle=True)
2

Define the model

Build a custom CNN or load a pretrained architecture from torchvision.models.
import torch.nn as nn

class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 32, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
        )
        self.classifier = nn.Sequential(
            nn.Linear(64 * 56 * 56, 512),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(512, num_classes)
        )

    def forward(self, x):
        x = self.features(x)
        x = x.view(x.size(0), -1)
        return self.classifier(x)
3

Set loss and optimizer

Cross-entropy loss is standard for multi-class classification. Adam is a reliable default optimizer.
import torch.nn as nn
import torch.optim as optim

model     = SimpleCNN(num_classes=10).cuda()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)
4

Run the training loop

Iterate over epochs, perform forward and backward passes, and update weights.
import torch

for epoch in range(num_epochs):
    model.train()
    running_loss = 0.0
    for images, labels in train_loader:
        images, labels = images.cuda(), labels.cuda()

        optimizer.zero_grad()
        outputs = model(images)
        loss    = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        running_loss += loss.item()

    print(f"Epoch {epoch+1}, Loss: {running_loss/len(train_loader):.4f}")

Transfer learning with pretrained models

Training from scratch requires large datasets. Transfer learning repurposes a model pretrained on ImageNet (1.2 M images, 1000 classes) by replacing only the final classification head.
import torch
import torch.nn as nn
import torchvision.transforms as transforms
from torchvision import datasets, models

# Load pretrained model
model = models.resnet18(pretrained=True)
model.fc = nn.Linear(512, num_classes)

# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
Freeze the backbone layers initially (param.requires_grad = False for all but model.fc), train the head for a few epochs, then unfreeze and fine-tune end-to-end with a lower learning rate (e.g., 1e-5).

Common pretrained architectures

ModelTop-1 accuracy (ImageNet)ParametersNotes
ResNet-1869.8%11 MFast, good baseline
ResNet-5076.1%25 MStrong general-purpose model
VGG-1671.6%138 MSimple architecture, large
EfficientNet-B077.1%5.3 MBest accuracy/size trade-off
MobileNetV374.0%5.4 MOptimized for edge devices

Evaluation

After training, evaluate on a held-out test set:
model.eval()
correct = total = 0
with torch.no_grad():
    for images, labels in test_loader:
        images, labels = images.cuda(), labels.cuda()
        outputs = model(images)
        _, predicted = outputs.max(1)
        total   += labels.size(0)
        correct += predicted.eq(labels).sum().item()

print(f"Test accuracy: {100 * correct / total:.2f}%")

Resources

Exercise E05: CNN Training

Hands-on exercise: train a CNN from scratch in Google Colab.

VisionColab: Image Classification

Collection of CNN examples and notebooks from the course.

Video: CNN Lecture (2021)

Recorded lecture covering CNN architecture and training.

Video: Complementary CNN

Additional video resource on convolutional neural networks.

Build docs developers (and LLMs) love