Running training on HPC with SLURM - Stable Diffusion from Scratch

This guide covers running diffusion training on HPC clusters using SLURM (Simple Linux Utility for Resource Management). The examples target TACC Lonestar6 but apply to most HPC systems.

Quick start

Update SLURM script

Edit the SLURM script to add your information:

nano slurm/run_diffusion_cifar.slurm

Update these required fields:

#SBATCH --mail-user=your.email@domain.com
#SBATCH -A your-project-allocation

Check your allocation

Verify you have access to GPU resources:

squeue -u $USER
taccinfo  # TACC-specific

Submit the job

Submit your training job to the queue:

H100 GPU (fastest)

sbatch slurm/run_diffusion_cifar.slurm

A100 GPU (more available)

sbatch slurm/run_diffusion_a100.slurm

MNIST quick test

sbatch slurm/run_diffusion_mnist.slurm

Monitor your job

# Check job status
squeue -u $USER

# View output in real-time
tail -f diffusion_cifar_<JOBID>.out

# Check errors
tail -f diffusion_cifar_<JOBID>.err

SLURM script anatomy

CIFAR-10 on A100 GPUs

Here’s a complete SLURM script for CIFAR-10 training:

slurm/run_diffusion_cifar.slurm

#!/bin/bash
#SBATCH -J diffusion_cifar          # Job name
#SBATCH -o diffusion_cifar_%j.out   # Output file (%j = job ID)
#SBATCH -e diffusion_cifar_%j.err   # Error file
#SBATCH -p gpu-a100                 # Partition (gpu-h100 or gpu-a100)
#SBATCH -N 1                        # Number of nodes
#SBATCH -n 1                        # Number of tasks (processes)
#SBATCH -t 48:00:00                 # Wall clock time (48 hours)
#SBATCH --mail-user=aymanmahfuz27@utexas.edu
#SBATCH --mail-type=all             # Email notifications
#SBATCH -A ASC25078                 # Allocation name

# Load required modules
module purge
module load cuda/12.8
module load python/3.12.11

# Install dependencies (only first time)
pip3 install --user torch torchvision torchaudio matplotlib tqdm 2>&1 | grep -v "already satisfied" || true

# Navigate to working directory
cd $SLURM_SUBMIT_DIR

# Print environment info
echo "Job started at: $(date)"
echo "Running on node: $(hostname)"
echo "Working directory: $(pwd)"
echo "CUDA_VISIBLE_DEVICES: $CUDA_VISIBLE_DEVICES"
echo "Checkpoints will be saved to: $WORK/stable-diffusion-cifar/"

# Set PyTorch optimizations
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
export TORCH_CUDA_ARCH_LIST="8.0;9.0"  # A100=8.0, H100=9.0

# Run training with GPU binding
srun --gpu-bind=single:1 -u python3 src/training/train_diffusion_cifar.py

echo "Job finished at: $(date)"
echo "Checkpoints: $WORK/stable-diffusion-cifar/checkpoints/"
echo "Samples: $WORK/stable-diffusion-cifar/cifar_samples/"

MNIST quick test

For rapid testing, use the MNIST script with shorter runtime:

slurm/run_diffusion_mnist.slurm

#!/bin/bash
#SBATCH -J diffusion_mnist
#SBATCH -o diffusion_mnist_%j.out
#SBATCH -e diffusion_mnist_%j.err
#SBATCH -p gpu-a100
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -t 2:00:00  # 2 hours is enough for MNIST
#SBATCH --mail-user=aymanmahfuz27@utexas.edu
#SBATCH --mail-type=all
#SBATCH -A ASC25078

module purge
module load cuda/12.8
module load python/3.12.11

cd $SLURM_SUBMIT_DIR

echo "Job started at: $(date)"
echo "Running on node: $(hostname)"

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
export TORCH_CUDA_ARCH_LIST="8.0;9.0"

# Run MNIST training
srun --gpu-bind=single:1 -u python3 src/training/train_diffusion.py

echo "Job finished at: $(date)"
echo "MNIST samples saved to samples/ directory"

Key SLURM directives

Resource allocation

#SBATCH -p gpu-a100      # Partition/queue name
#SBATCH -N 1             # Number of nodes
#SBATCH -n 1             # Number of MPI tasks
#SBATCH --gpus-per-node=1  # GPUs per node
#SBATCH -t 48:00:00      # Max runtime (HH:MM:SS)

Job identification

#SBATCH -J diffusion_cifar          # Job name
#SBATCH -o diffusion_cifar_%j.out   # stdout (%j = job ID)
#SBATCH -e diffusion_cifar_%j.err   # stderr

Notifications

#SBATCH --mail-user=your.email@domain.com
#SBATCH --mail-type=all  # all, begin, end, fail

Account/allocation

#SBATCH -A your-project-allocation

The allocation name is required and must match your active TACC project. Find yours with taccinfo -p.

GPU queue information

Lonestar6 GPU queues

Queue	GPUs	VRAM	Nodes	Max Time	Best For
`gpu-h100`	H100	80GB	5	48h	Fastest training, newest
`gpu-a100`	A100	40GB	46	48h	More availability, fast
`vm-small`	A40	48GB	-	2h	Quick testing only

Check queue availability

# See available GPUs in each queue
sinfo -p gpu-h100
sinfo -p gpu-a100

# See queue limits and policies
qlimits

# See jobs in queue
squeue -p gpu-a100

Module management

Required modules

Lonestar6 requires specific modules for GPU training:

module purge          # Clear existing modules
module load cuda/12.8
module load python/3.12.11

Check loaded modules

module list

Available versions

module avail cuda
module avail python

Python environment setup

User installation (simplest)

Install packages to your home directory:

pip3 install --user torch torchvision torchaudio matplotlib tqdm

Virtual environment (recommended)

Create an isolated environment:

module load python/3.12.11
python3 -m venv ~/venv-diffusion
source ~/venv-diffusion/bin/activate
pip install torch torchvision torchaudio matplotlib tqdm

Then uncomment this line in your SLURM script:

source ~/venv-diffusion/bin/activate

Conda environment

If you prefer conda:

module load conda
conda create -n diffusion python=3.12
conda activate diffusion
pip install torch torchvision torchaudio matplotlib tqdm

Job management

Submit a job

sbatch slurm/run_diffusion_cifar.slurm

Returns: Submitted batch job 123456

Check job status

# Your jobs
squeue -u $USER

# Specific job
squeue -j 123456

# Detailed job info
scontrol show job 123456

Monitor job output

# Follow stdout in real-time
tail -f diffusion_cifar_123456.out

# Check for errors
tail -f diffusion_cifar_123456.err

# View full output
less diffusion_cifar_123456.out

Cancel a job

# Cancel specific job
scancel 123456

# Cancel all your jobs
scancel -u $USER

Job history

# Recent jobs
sacct -u $USER

# Detailed job info
sacct -j 123456 --format=JobID,JobName,Partition,State,ExitCode,Elapsed

File system paths

Important directories

Variable	Path	Purpose	Backed Up	Quota
`$HOME`	`/home1/12345/username`	Code, scripts	Yes	10GB
`$WORK`	`/work2/12345/username`	Checkpoints, models	No	1TB
`$SCRATCH`	`/scratch/12345/username`	Temporary data	No	Unlimited

Output location

The training script saves outputs to $WORK:

slurm/run_diffusion_cifar.slurm

echo "HOME: $HOME"
echo "WORK: $WORK"
echo "SCRATCH: $SCRATCH"
echo "Checkpoints will be saved to: $WORK/stable-diffusion-cifar/"

Store large files (checkpoints, datasets) in $WORK or $SCRATCH, not $HOME. $HOME has a strict 10GB quota.

Check disk usage

# Your quota and usage
quota -s

# Directory sizes
du -sh $WORK/*
du -sh $SCRATCH/*

Training configuration

Resume from checkpoint

Resume training by setting environment variables in the SLURM script:

# Add before srun command
export RESUME_FROM_BEST=1
export EPOCHS=3000

srun --gpu-bind=single:1 -u python3 src/training/train_diffusion_cifar.py

Disable early stopping

For long training runs:

export EARLY_STOP=0
export EPOCHS=2000

Custom checkpoint path

export RESUME_FROM="$WORK/stable-diffusion-cifar/checkpoints/checkpoint_epoch1000.pt"

Performance tuning

GPU binding

Bind each task to a single GPU for optimal performance:

srun --gpu-bind=single:1 -u python3 src/training/train_diffusion_cifar.py

CUDA optimizations

slurm/run_diffusion_cifar.slurm

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
export TORCH_CUDA_ARCH_LIST="8.0;9.0"  # A100=8.0, H100=9.0

PyTorch sanity check

The SLURM script includes a GPU verification step:

slurm/run_diffusion_cifar.slurm

echo "=== PyTorch CUDA sanity check ==="
srun --gpu-bind=single:1 python3 - << 'EOF'
import os, torch
print("CUDA visible:", os.environ.get("CUDA_VISIBLE_DEVICES"))
print("torch.cuda.is_available:", torch.cuda.is_available())
print("device_count:", torch.cuda.device_count())
if torch.cuda.is_available():
    print("device 0:", torch.cuda.get_device_name(0))
    x = torch.linspace(0, 1, 4, device="cuda")
    print("linspace on cuda ok:", x.tolist())
EOF

This validates GPU access before starting long training runs.

Expected training times

MNIST

H100: 8-10 minutes (50 epochs)
A100: 12-15 minutes (50 epochs)

CIFAR-10

H100: 15-18 hours (2000 epochs, batch_size=256)
A100: 18-22 hours (2000 epochs, batch_size=256)

For the fastest results, use H100 GPUs in the gpu-h100 queue. They provide ~1.5× speedup over A100.

Troubleshooting

Job pending forever

Problem: Job stays in PD (pending) state. Solutions:

# Check allocation is active
taccinfo

# Verify allocation name
taccinfo -p

# Check queue availability
sinfo -p gpu-a100

Module not found

Problem: ModuleCmd_Load.c(213):ERROR:105: Unable to locate a modulefile Solution: Use correct module names for your HPC system:

# List available modules
module avail

# Search for specific module
module avail cuda
module avail python

Out of memory

Problem: CUDA out of memory error. Solutions:

Reduce batch size in src/training/train_diffusion_cifar.py:
```
batch_size = 128  # or 64
```
Use gradient accumulation (already enabled by default)
Request more VRAM:
```
#SBATCH -p gpu-h100  # 80GB vs 40GB
```

Wrong allocation name

Problem: sbatch: error: Batch job submission failed: Invalid account or account/partition combination Solution: Find your allocations:

taccinfo -p
# Or
sacctmgr show user $USER

Python packages not found

Problem: ModuleNotFoundError: No module named 'torch' Solution: Install packages or activate your virtual environment:

pip3 install --user torch torchvision torchaudio matplotlib tqdm

Or in SLURM script:

source ~/venv-diffusion/bin/activate

Interactive debugging

For testing before submitting long jobs:

Request interactive GPU session

idev -p gpu-a100 -N 1 -n 1 -t 2:00:00

This provides:

1 A100 GPU
2 hours
Interactive shell

Run training interactively

module load cuda/12.8
module load python/3.12.11
python3 src/training/train_diffusion.py

Exit interactive session

exit

Interactive sessions are limited to 2 hours and should only be used for debugging, not full training runs.

Output files

Job logs

diffusion_cifar_<JOBID>.out - Training progress, loss, epoch info
diffusion_cifar_<JOBID>.err - Errors, warnings, stack traces

Training outputs

Saved to $WORK/stable-diffusion-cifar/:

$WORK/stable-diffusion-cifar/
├── checkpoints/
│   ├── checkpoint_latest.pt
│   ├── checkpoint_best.pt
│   └── checkpoint_epoch{N}.pt
├── cifar_samples/
│   ├── samples_epoch{N}.png
│   ├── noising_epoch{N}.png
│   ├── training_curve_cifar.png
│   ├── DDPM_CIFAR.png
│   └── DDIM_CIFAR.png
└── best_model_cifar.pt

View outputs

# List checkpoints
ls -lh $WORK/stable-diffusion-cifar/checkpoints/

# List samples
ls -lh $WORK/stable-diffusion-cifar/cifar_samples/

# Check model size
du -sh $WORK/stable-diffusion-cifar/best_model_cifar.pt

Resources

Next steps

Optimize hyperparameters for your dataset
Experiment with different model architectures
Try multi-GPU training with distributed data parallel

Get Started

Core Concepts

Training Guides

Model Architecture

Sampling & Inference

Experiments

​Quick start

​SLURM script anatomy

​CIFAR-10 on A100 GPUs

​MNIST quick test

​Key SLURM directives

​Resource allocation

​Job identification

​Notifications

​Account/allocation

​GPU queue information

​Lonestar6 GPU queues

​Check queue availability

​Module management

​Required modules

​Check loaded modules

​Available versions

​Python environment setup

​User installation (simplest)

​Virtual environment (recommended)

​Conda environment

​Job management

​Submit a job

​Check job status

​Monitor job output

​Cancel a job

​Job history

​File system paths

​Important directories

​Output location

​Check disk usage

​Training configuration

​Resume from checkpoint

​Disable early stopping

​Custom checkpoint path

​Performance tuning

​GPU binding

​CUDA optimizations

​PyTorch sanity check

​Expected training times

​MNIST

​CIFAR-10

​Troubleshooting

​Job pending forever

​Module not found

​Out of memory

​Wrong allocation name

​Python packages not found

​Interactive debugging

​Request interactive GPU session

​Run training interactively

​Exit interactive session

​Output files

​Job logs

​Training outputs

​View outputs

​Resources

​Next steps

Build docs developers (and LLMs) love

Quick start

SLURM script anatomy

CIFAR-10 on A100 GPUs

MNIST quick test

Key SLURM directives

Resource allocation

Job identification

Notifications

Account/allocation

GPU queue information

Lonestar6 GPU queues

Check queue availability

Module management

Required modules

Check loaded modules

Available versions

Python environment setup

User installation (simplest)

Virtual environment (recommended)

Conda environment

Job management

Submit a job

Check job status

Monitor job output

Cancel a job

Job history

File system paths

Important directories

Output location

Check disk usage

Training configuration

Resume from checkpoint

Disable early stopping

Custom checkpoint path

Performance tuning

GPU binding

CUDA optimizations

PyTorch sanity check

Expected training times

MNIST

CIFAR-10

Troubleshooting

Job pending forever

Module not found

Out of memory

Wrong allocation name

Python packages not found

Interactive debugging

Request interactive GPU session

Run training interactively

Exit interactive session

Output files

Job logs

Training outputs

View outputs

Resources

Next steps