Skip to main content
Generative models learn the underlying data distribution and can synthesize new, realistic samples. The two dominant paradigms in computer vision are Generative Adversarial Networks (GANs) and Diffusion Models.

GAN fundamentals

A GAN consists of two networks trained in opposition:
  • Generator GG: maps random noise zpz\mathbf{z} \sim p_z to a synthetic image G(z)G(\mathbf{z}).
  • Discriminator DD: classifies inputs as real (from the dataset) or fake (from GG).

Training objective (minimax game)

minGmaxD  Expdata[logD(x)]+Ezpz[log(1D(G(z)))]\min_G \max_D \; \mathbb{E}_{\mathbf{x} \sim p_{\text{data}}}[\log D(\mathbf{x})] + \mathbb{E}_{\mathbf{z} \sim p_z}[\log(1 - D(G(\mathbf{z})))] At the Nash equilibrium, GG produces samples indistinguishable from real data and DD outputs 12\frac{1}{2} everywhere.

GAN training dynamics

1

Train the discriminator

Sample a minibatch of real images and a minibatch of generated images. Update DD to maximize the log-likelihood of correctly classifying both.
2

Train the generator

Sample new noise vectors. Update GG to fool DD — maximize logD(G(z))\log D(G(\mathbf{z})) (non-saturating variant).
3

Repeat

Alternate DD and GG updates for many iterations. Monitor FID (Fréchet Inception Distance) to measure generation quality.

Basic GAN in PyTorch

import torch
import torch.nn as nn

latent_dim = 100
img_dim    = 784  # 28×28 flattened

# Generator
generator = nn.Sequential(
    nn.Linear(latent_dim, 256),
    nn.LeakyReLU(0.2),
    nn.Linear(256, 512),
    nn.LeakyReLU(0.2),
    nn.Linear(512, img_dim),
    nn.Tanh()          # output in [-1, 1]
)

# Discriminator
discriminator = nn.Sequential(
    nn.Linear(img_dim, 512),
    nn.LeakyReLU(0.2),
    nn.Dropout(0.3),
    nn.Linear(512, 256),
    nn.LeakyReLU(0.2),
    nn.Dropout(0.3),
    nn.Linear(256, 1),
    nn.Sigmoid()       # probability of being real
)

criterion = nn.BCELoss()
opt_G = torch.optim.Adam(generator.parameters(),     lr=2e-4, betas=(0.5, 0.999))
opt_D = torch.optim.Adam(discriminator.parameters(), lr=2e-4, betas=(0.5, 0.999))

def train_step(real_imgs):
    batch_size = real_imgs.size(0)
    real_labels = torch.ones(batch_size, 1)
    fake_labels = torch.zeros(batch_size, 1)

    # --- Train discriminator ---
    z         = torch.randn(batch_size, latent_dim)
    fake_imgs = generator(z).detach()
    loss_D    = criterion(discriminator(real_imgs), real_labels) + \
                criterion(discriminator(fake_imgs), fake_labels)
    opt_D.zero_grad(); loss_D.backward(); opt_D.step()

    # --- Train generator ---
    z         = torch.randn(batch_size, latent_dim)
    fake_imgs = generator(z)
    loss_G    = criterion(discriminator(fake_imgs), real_labels)  # fool D
    opt_G.zero_grad(); loss_G.backward(); opt_G.step()

    return loss_D.item(), loss_G.item()

GAN variants

Replaces linear layers with transposed convolutions (generator) and strided convolutions (discriminator). Uses batch normalization throughout. DCGAN produces sharper images and trains more stably than the original GAN.Key architectural guidelines:
  • No pooling layers — use strided convolutions for downsampling.
  • Batch norm in both GG and DD (except the output of GG and input of DD).
  • ReLU activations in GG; LeakyReLU in DD.
Conditions both GG and DD on a class label yy (or any auxiliary information). The generator produces images of a specific class: G(z,y)G(\mathbf{z}, y). Useful for class-conditional image synthesis and data augmentation.
Image-to-image translation GANs. Pix2Pix requires paired training images (input→output). CycleGAN learns the translation without paired data using a cycle-consistency loss.

Stable Diffusion and diffusion models

Diffusion models have surpassed GANs in image quality and diversity. They operate by learning to reverse a gradual noising process.

The diffusion process

Forward process: add Gaussian noise over TT timesteps until the image is pure noise: q(xtxt1)=N(xt;1βtxt1,βtI)q(\mathbf{x}_t | \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t;\, \sqrt{1-\beta_t}\,\mathbf{x}_{t-1},\, \beta_t \mathbf{I}) Reverse process: a neural network ϵθ\epsilon_\theta learns to predict the noise added at each step, enabling denoising: LDDPM=Et,x0,ϵ[ϵϵθ(xt,t)2]\mathcal{L}_{\text{DDPM}} = \mathbb{E}_{t, \mathbf{x}_0, \epsilon}\left[\|\epsilon - \epsilon_\theta(\mathbf{x}_t, t)\|^2\right]

Latent Diffusion (Stable Diffusion)

Running diffusion in pixel space is computationally expensive. Latent Diffusion Models (LDM) compress images to a low-dimensional latent space using a pretrained VAE, then run diffusion there:
  1. Encode: z=E(x)\mathbf{z} = \mathcal{E}(\mathbf{x})
  2. Diffuse/denoise in latent space: z0\mathbf{z}_0 \leftarrow LDM
  3. Decode: x^=D(z0)\hat{\mathbf{x}} = \mathcal{D}(\mathbf{z}_0)
Text conditioning is provided via a CLIP text encoder, enabling text-to-image generation.

Applications

ApplicationMethodNotes
Synthetic data augmentationConditional GAN / diffusionGenerate rare or minority-class examples
Style transferCycleGAN, neural styleTransform image appearance
Super-resolutionSRGAN, ESRGANUpsample low-resolution images
InpaintingLaMa, diffusionFill masked regions
Text-to-imageStable Diffusion, DALL-EGenerate from text prompts

Resources

Exercise E09: Image Generation with GAN

Hands-on exercise: train a GAN to generate images from a dataset.

VisionColab: GAN Examples

Collection of GAN examples from the course repository.

Diffusion Models Blog

Accessible overview of diffusion models, DDPM, and latent diffusion.

Video: UNet, GAN & Anomaly Detection

Recorded lecture covering GANs alongside UNet and anomaly detection.

Build docs developers (and LLMs) love