Installation

Prerequisites

Before you begin, make sure you have:

Python 3.10-3.12

Any reasonably recent version in the 3.10-3.12 range should work

CUDA GPU (recommended)

Optional but strongly recommended for faster training. CPU training is supported but much slower.

The code automatically detects CUDA and uses GPU acceleration when available. No manual configuration needed.

Local installation

Clone the repository

First, clone the repository to your local machine:

git clone https://github.com/your-username/stable-diffusion-scratch.git
cd stable-diffusion-scratch

Create a virtual environment

It’s recommended to use a virtual environment to avoid dependency conflicts:

python -m venv .venv
source .venv/bin/activate

You should see (.venv) appear at the start of your terminal prompt, indicating the virtual environment is active.

Install dependencies

Install all required packages from requirements.txt:

pip install -r requirements.txt

This installs:

torch - PyTorch deep learning framework
torchvision - Image datasets and transformations
torchaudio - Audio processing utilities
matplotlib - Plotting and visualization
tqdm - Progress bars for training loops

Verify installation

Check that PyTorch can detect your GPU (if available):

python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"

Expected output:

CUDA available: True  # If you have a CUDA-enabled GPU
CUDA available: False # If running on CPU

Installing PyTorch with CUDA

If you have a CUDA-enabled GPU but the above shows CUDA available: False, you may need to install PyTorch with CUDA support explicitly.

CUDA 11.8
CUDA 12.1
CPU only

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

Make sure your CUDA version matches the PyTorch wheel you install. Check your CUDA version with nvcc --version or nvidia-smi.

Dataset setup

The datasets are automatically downloaded on the first run. No manual setup required!

MNIST dataset

When you run python src/training/train_diffusion.py for the first time, the code will automatically:

Download MNIST from torchvision.datasets
Save it to data/MNIST/
Process and cache the data

dataset = datasets.MNIST(
    root="./data", 
    train=True, 
    download=True,  # Automatic download!
    transform=transform
)

Dataset size: ~50 MB

CIFAR-10 dataset

Similarly, CIFAR-10 is downloaded automatically when running the CIFAR training script:

dataset = datasets.CIFAR10(
    root="./data",
    train=True,
    download=True,  # Automatic download!
    transform=transform
)

Dataset size: ~170 MB

The first run will take a few extra minutes to download the datasets. Subsequent runs will use the cached data.

HPC cluster setup

If you’re using an HPC cluster with SLURM, you can use the provided batch scripts.

Environment modules

Most clusters use environment modules for CUDA and Python:

module load cuda/11.8
module load python/3.11

Module names vary by cluster. Check your cluster’s documentation or run module avail to see available modules.

SLURM scripts

The repository includes ready-to-use SLURM scripts in the slurm/ directory:

sbatch slurm/run_diffusion_mnist.slurm

Example SLURM configuration

Here’s what a typical SLURM script looks like:

#!/bin/bash
#SBATCH --job-name=mnist-ddpm
#SBATCH --nodes=1
#SBATCH --gpus-per-node=1
#SBATCH --time=02:00:00
#SBATCH --mem=16G

module load cuda/11.8
module load python/3.11

# Create and activate virtual environment
python -m venv $SLURM_TMPDIR/env
source $SLURM_TMPDIR/env/bin/activate

# Install dependencies
pip install --no-index torch torchvision matplotlib tqdm

# Run training
python src/training/train_diffusion.py

Use $SLURM_TMPDIR for temporary files and virtual environments on compute nodes. It’s much faster than your home directory.

Project structure

After installation, your directory should look like this:

.
├── src/
│   ├── models/
│   │   ├── diffusion.py          # MNIST U-Net and diffusion process
│   │   └── diffusion_cifar.py    # CIFAR-10 U-Net with EMA
│   ├── training/
│   │   ├── train_diffusion.py    # Train MNIST DDPM
│   │   └── train_diffusion_cifar.py   # Train CIFAR-10 DDPM
│   └── utilities/
│       ├── ddim_comparison_mnist.py   # DDPM vs DDIM benchmarks
│       ├── ddim_comparison_cifar.py
│       └── interpolation_and_timesteps.py
├── slurm/                        # SLURM batch scripts
├── data/                         # Auto-created on first run
├── samples/                      # Training visualizations
├── requirements.txt
└── README.md

Core models (src/models/)

Contains the U-Net architectures and diffusion processes:

diffusion.py - MNIST model with cosine beta schedule
diffusion_cifar.py - CIFAR-10 model with linear schedule and EMA

Training scripts (src/training/)

Main entry points for training:

train_diffusion.py - Train MNIST DDPM (50 epochs, ~5-10 min)
train_diffusion_cifar.py - Train CIFAR-10 DDPM (2000 epochs, GPU required)

Utilities (src/utilities/)

Analysis and comparison scripts:

ddim_comparison_mnist.py - Benchmark DDPM vs DDIM on MNIST
ddim_comparison_cifar.py - Benchmark DDPM vs DDIM on CIFAR-10
interpolation_and_timesteps.py - Latent interpolation and timestep analysis

SLURM scripts (slurm/)

Ready-to-use batch scripts for HPC clusters:

run_diffusion_mnist.slurm - MNIST training job
run_diffusion_cifar.slurm - CIFAR-10 training job
run_ddim_comparison.slurm - DDPM vs DDIM comparison jobs

Performance optimization

The code includes several optimizations for faster training:

Mixed precision training

Automatic mixed precision (AMP) is enabled when using CUDA:

if self.device.type == 'cuda':
    self.grad_scaler = torch.amp.GradScaler('cuda')
    self.autocast_ctx = lambda: torch.amp.autocast('cuda')
else:
    self.grad_scaler = torch.amp.GradScaler('cuda', enabled=False)
    self.autocast_ctx = lambda: nullcontext()

This can speed up training by 2-3x on modern GPUs with minimal precision loss.

CUDNN benchmarking

The training scripts automatically enable CUDNN benchmarking:

if device.type == "cuda":
    torch.backends.cudnn.benchmark = True
    if hasattr(torch, "set_float32_matmul_precision"):
        torch.set_float32_matmul_precision("high")

Data loading

Efficient data loading with multiple workers and pinned memory:

num_workers = min(8, os.cpu_count() or 4)
loader = DataLoader(
    dataset,
    batch_size=128,
    shuffle=True,
    num_workers=num_workers,
    pin_memory=(device.type == "cuda"),
    persistent_workers=num_workers > 0,
)

Troubleshooting

Out of memory errors

If you encounter CUDA out-of-memory errors:

Reduce batch_size in the training script (default is 128)
Reduce hidden_dims for a smaller model
Use gradient accumulation to simulate larger batches

# Reduce batch size
batch_size = 64  # or 32

# Smaller model
hidden_dims = [64, 128, 256]  # instead of [128, 256, 512]

Slow training on CPU

CPU training is significantly slower than GPU training. For MNIST:

GPU: ~5-10 minutes
CPU: ~30-60 minutes

Consider:

Using a smaller model with fewer hidden_dims
Reducing the number of epochs
Using Google Colab or Kaggle for free GPU access

Import errors

If you see ModuleNotFoundError, make sure you:

Activated your virtual environment
Installed all dependencies from requirements.txt
Are running scripts from the project root directory

# Verify you're in the project root
pwd  # Should show .../stable-diffusion-scratch

# Verify virtual environment is active
which python  # Should show .../venv/bin/python

Get Started

Core Concepts

Training Guides

Model Architecture

Sampling & Inference

Experiments

Prerequisites

Python 3.10-3.12

CUDA GPU (recommended)

Local installation

Installing PyTorch with CUDA

Dataset setup

HPC cluster setup

Environment modules

SLURM scripts

Example SLURM configuration

Project structure

Performance optimization

Mixed precision training

CUDNN benchmarking

Data loading

Troubleshooting

Next steps

Quick start

Introduction

Build docs developers (and LLMs) love

Get Started

Core Concepts

Training Guides

Model Architecture

Sampling & Inference

Experiments

​Prerequisites

Python 3.10-3.12

CUDA GPU (recommended)

​Local installation

​Installing PyTorch with CUDA

​Dataset setup

​HPC cluster setup

​Environment modules

​SLURM scripts

​Example SLURM configuration

​Project structure

​Performance optimization

​Mixed precision training

​CUDNN benchmarking

​Data loading

​Troubleshooting

​Next steps

Quick start

Introduction

Build docs developers (and LLMs) love

Prerequisites

Local installation

Installing PyTorch with CUDA

Dataset setup

HPC cluster setup

Environment modules

SLURM scripts

Example SLURM configuration

Project structure

Performance optimization

Mixed precision training

CUDNN benchmarking

Data loading

Troubleshooting

Next steps