Skip to main content

Prerequisites

Before you begin, make sure you have:

Python 3.10-3.12

Any reasonably recent version in the 3.10-3.12 range should work

CUDA GPU (recommended)

Optional but strongly recommended for faster training. CPU training is supported but much slower.
The code automatically detects CUDA and uses GPU acceleration when available. No manual configuration needed.

Local installation

1

Clone the repository

First, clone the repository to your local machine:
git clone https://github.com/your-username/stable-diffusion-scratch.git
cd stable-diffusion-scratch
2

Create a virtual environment

It’s recommended to use a virtual environment to avoid dependency conflicts:
python -m venv .venv
source .venv/bin/activate
You should see (.venv) appear at the start of your terminal prompt, indicating the virtual environment is active.
3

Install dependencies

Install all required packages from requirements.txt:
pip install -r requirements.txt
This installs:
  • torch - PyTorch deep learning framework
  • torchvision - Image datasets and transformations
  • torchaudio - Audio processing utilities
  • matplotlib - Plotting and visualization
  • tqdm - Progress bars for training loops
4

Verify installation

Check that PyTorch can detect your GPU (if available):
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
Expected output:
CUDA available: True  # If you have a CUDA-enabled GPU
CUDA available: False # If running on CPU

Installing PyTorch with CUDA

If you have a CUDA-enabled GPU but the above shows CUDA available: False, you may need to install PyTorch with CUDA support explicitly.
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
Make sure your CUDA version matches the PyTorch wheel you install. Check your CUDA version with nvcc --version or nvidia-smi.

Dataset setup

The datasets are automatically downloaded on the first run. No manual setup required!
When you run python src/training/train_diffusion.py for the first time, the code will automatically:
  1. Download MNIST from torchvision.datasets
  2. Save it to data/MNIST/
  3. Process and cache the data
dataset = datasets.MNIST(
    root="./data", 
    train=True, 
    download=True,  # Automatic download!
    transform=transform
)
Dataset size: ~50 MB
Similarly, CIFAR-10 is downloaded automatically when running the CIFAR training script:
dataset = datasets.CIFAR10(
    root="./data",
    train=True,
    download=True,  # Automatic download!
    transform=transform
)
Dataset size: ~170 MB
The first run will take a few extra minutes to download the datasets. Subsequent runs will use the cached data.

HPC cluster setup

If you’re using an HPC cluster with SLURM, you can use the provided batch scripts.

Environment modules

Most clusters use environment modules for CUDA and Python:
module load cuda/11.8
module load python/3.11
Module names vary by cluster. Check your cluster’s documentation or run module avail to see available modules.

SLURM scripts

The repository includes ready-to-use SLURM scripts in the slurm/ directory:
sbatch slurm/run_diffusion_mnist.slurm

Example SLURM configuration

Here’s what a typical SLURM script looks like:
#!/bin/bash
#SBATCH --job-name=mnist-ddpm
#SBATCH --nodes=1
#SBATCH --gpus-per-node=1
#SBATCH --time=02:00:00
#SBATCH --mem=16G

module load cuda/11.8
module load python/3.11

# Create and activate virtual environment
python -m venv $SLURM_TMPDIR/env
source $SLURM_TMPDIR/env/bin/activate

# Install dependencies
pip install --no-index torch torchvision matplotlib tqdm

# Run training
python src/training/train_diffusion.py
Use $SLURM_TMPDIR for temporary files and virtual environments on compute nodes. It’s much faster than your home directory.

Project structure

After installation, your directory should look like this:
.
├── src/
│   ├── models/
│   │   ├── diffusion.py          # MNIST U-Net and diffusion process
│   │   └── diffusion_cifar.py    # CIFAR-10 U-Net with EMA
│   ├── training/
│   │   ├── train_diffusion.py    # Train MNIST DDPM
│   │   └── train_diffusion_cifar.py   # Train CIFAR-10 DDPM
│   └── utilities/
│       ├── ddim_comparison_mnist.py   # DDPM vs DDIM benchmarks
│       ├── ddim_comparison_cifar.py
│       └── interpolation_and_timesteps.py
├── slurm/                        # SLURM batch scripts
├── data/                         # Auto-created on first run
├── samples/                      # Training visualizations
├── requirements.txt
└── README.md
Contains the U-Net architectures and diffusion processes:
  • diffusion.py - MNIST model with cosine beta schedule
  • diffusion_cifar.py - CIFAR-10 model with linear schedule and EMA
Main entry points for training:
  • train_diffusion.py - Train MNIST DDPM (50 epochs, ~5-10 min)
  • train_diffusion_cifar.py - Train CIFAR-10 DDPM (2000 epochs, GPU required)
Analysis and comparison scripts:
  • ddim_comparison_mnist.py - Benchmark DDPM vs DDIM on MNIST
  • ddim_comparison_cifar.py - Benchmark DDPM vs DDIM on CIFAR-10
  • interpolation_and_timesteps.py - Latent interpolation and timestep analysis
Ready-to-use batch scripts for HPC clusters:
  • run_diffusion_mnist.slurm - MNIST training job
  • run_diffusion_cifar.slurm - CIFAR-10 training job
  • run_ddim_comparison.slurm - DDPM vs DDIM comparison jobs

Performance optimization

The code includes several optimizations for faster training:

Mixed precision training

Automatic mixed precision (AMP) is enabled when using CUDA:
if self.device.type == 'cuda':
    self.grad_scaler = torch.amp.GradScaler('cuda')
    self.autocast_ctx = lambda: torch.amp.autocast('cuda')
else:
    self.grad_scaler = torch.amp.GradScaler('cuda', enabled=False)
    self.autocast_ctx = lambda: nullcontext()
This can speed up training by 2-3x on modern GPUs with minimal precision loss.

CUDNN benchmarking

The training scripts automatically enable CUDNN benchmarking:
if device.type == "cuda":
    torch.backends.cudnn.benchmark = True
    if hasattr(torch, "set_float32_matmul_precision"):
        torch.set_float32_matmul_precision("high")

Data loading

Efficient data loading with multiple workers and pinned memory:
num_workers = min(8, os.cpu_count() or 4)
loader = DataLoader(
    dataset,
    batch_size=128,
    shuffle=True,
    num_workers=num_workers,
    pin_memory=(device.type == "cuda"),
    persistent_workers=num_workers > 0,
)

Troubleshooting

If you encounter CUDA out-of-memory errors:
  1. Reduce batch_size in the training script (default is 128)
  2. Reduce hidden_dims for a smaller model
  3. Use gradient accumulation to simulate larger batches
# Reduce batch size
batch_size = 64  # or 32

# Smaller model
hidden_dims = [64, 128, 256]  # instead of [128, 256, 512]
CPU training is significantly slower than GPU training. For MNIST:
  • GPU: ~5-10 minutes
  • CPU: ~30-60 minutes
Consider:
  1. Using a smaller model with fewer hidden_dims
  2. Reducing the number of epochs
  3. Using Google Colab or Kaggle for free GPU access
If you see ModuleNotFoundError, make sure you:
  1. Activated your virtual environment
  2. Installed all dependencies from requirements.txt
  3. Are running scripts from the project root directory
# Verify you're in the project root
pwd  # Should show .../stable-diffusion-scratch

# Verify virtual environment is active
which python  # Should show .../venv/bin/python

Next steps

Quick start

Train your first MNIST diffusion model in under 10 minutes

Introduction

Learn about the architecture and design philosophy

Build docs developers (and LLMs) love