The Illustrated Stable Diffusion

Overview

While this book focuses on Large Language Models, the principles underlying modern generative AI extend beyond text. Stable Diffusion, one of the most influential text-to-image models, shares many architectural components and training techniques with LLMs. Understanding how diffusion models work provides valuable insights into the broader landscape of generative AI and helps connect concepts across modalities.

This illustrated guide is part of the bonus material for Hands-On Large Language Models. While it covers image generation rather than text, many concepts translate directly to language models and multimodal systems.

Why Study Stable Diffusion?

Understanding Stable Diffusion enriches your knowledge of generative AI:

Shared Components: Uses transformers, attention mechanisms, and embeddings like LLMs
Diffusion Process: A different generative paradigm from autoregressive text generation
Multimodal Learning: Bridges text and images through cross-attention
Conditioning: Similar to prompting in LLMs, but for image generation
Latent Spaces: Similar conceptual foundations to LLM embeddings

Models like Stable Diffusion, DALL-E, Midjourney, and Imagen have revolutionized image generation.

How Diffusion Differs from LLMs

Language Model Generation

Token by token, left to right, autoregressive
"The" → "cat" → "sat" → "on" → "the" → "mat"

Diffusion Model Generation

Start with noise, gradually denoise to create image
Noise → Slightly less noisy → ... → Clear image

Both use transformers and learned representations, but the generation process is fundamentally different.

What You’ll Learn

The illustrated guide provides a visual walkthrough of Stable Diffusion:

Core Architecture

VAE, U-Net, text encoder, and how components work together

Diffusion Process

How noise is progressively removed to generate images

Text Conditioning

How text prompts guide image generation through cross-attention

Latent Space

Operating in compressed latent space for efficiency

Illustrated Guide

Read the full illustrated guide by Jay Alammar, known for his exceptional visual explanations of complex AI systems.

While Stable Diffusion generates images, many concepts connect to the book:

Chapter 2: Tokens and Embeddings - Text encoding principles apply
Chapter 3: Looking Inside LLMs - Transformer architecture and attention
Chapter 6: Prompt Engineering - Conditioning with text prompts
Chapter 8: Customizing LLMs - Fine-tuning and adaptation techniques

Core Components

1. Variational Autoencoder (VAE)

The VAE compresses images to a latent space: Encoder

Compresses 512×512 image to 64×64 latent representation
Reduces computation by ~8x in each dimension
Preserves semantic information

Decoder

Converts latent representation back to full image
Upsamples 64×64 to 512×512
Reconstructs fine details

Why it matters: Operating in latent space makes diffusion tractable and efficient.

2. U-Net (Denoising Network)

The core of the diffusion process:

Input: Noisy latent + timestep + text conditioning
Output: Predicted noise to remove
Architecture: Encoder-decoder with skip connections
Attention layers: Cross-attention with text embeddings

The U-Net is applied iteratively, progressively denoising the image.

3. Text Encoder (CLIP)

Converts text prompts to embeddings:

Uses CLIP’s text encoder (similar to BERT)
Produces embeddings that capture semantic meaning
Same space where images are represented
Enables text-to-image alignment

4. Scheduler (Sampler)

Controls the denoising process:

Determines number of steps (typically 20-50)
Schedules noise levels over time
Different samplers: DDPM, DDIM, Euler, etc.
Trade-offs between quality and speed

The Diffusion Process

Forward Diffusion (Training)

Gradually add noise to images:

Clean Image → +noise → +more noise → ... → Pure Noise

The model learns to reverse this process.

Reverse Diffusion (Generation)

Start with noise, progressively denoise:

Pure Noise → -predicted noise → -more predicted noise → ... → Generated Image

At each step:

U-Net predicts noise in current latent
Remove (some of) that noise
Repeat for N steps
Decode final latent to image

Conditioning with Text

Text embeddings condition the denoising:

Cross-attention layers in U-Net
Text embeddings guide what to generate
Classifier-free guidance: strengthens conditioning
Negative prompts: guides away from concepts

Key Concepts

Latent Space Efficiency

Direct Image Generation (DALL-E 1 style)

Operate on full resolution pixels
Computationally expensive
Requires massive resources

Latent Diffusion (Stable Diffusion)

Compress to latent space with VAE
Denoise in low-dimensional space
Much more efficient
Enables consumer hardware generation

This efficiency innovation is similar to quantization in LLMs - making powerful models accessible.

Attention Mechanisms

Self-Attention

Within the image latent
Allows spatial coherence
Similar to self-attention in transformers

Cross-Attention

Between image latent and text embeddings
Aligns image features with text concepts
Key to text-guided generation

These are the same attention mechanisms used in LLMs!

Classifier-Free Guidance

Controls adherence to prompt:

prediction = uncond_prediction + guidance_scale × (cond_prediction - uncond_prediction)

Low guidance (1.0-3.0): More creative, less prompt adherence
Medium guidance (5.0-10.0): Balanced
High guidance (15.0+): Strong prompt adherence, may reduce quality

Similar to temperature in LLM generation - controlling randomness vs. determinism.

Prompt Engineering

Like LLMs, diffusion models respond to prompt engineering: Effective Prompts

Specific descriptions: “a red apple on a wooden table”
Style modifiers: “digital art”, “oil painting”, “photorealistic”
Quality boosters: “high quality”, “detailed”, “sharp focus”
Artist names: “in the style of [artist]”

Negative Prompts

Guide away from unwanted features
Common: “blurry, low quality, distorted”
Similar to “avoid X” instructions in LLM prompts

Prompt engineering for Stable Diffusion follows similar principles to LLM prompting - being specific, using examples (reference images), and iterative refinement.

Architecture Connections to LLMs

Shared Components

Component	In Stable Diffusion	In LLMs
Transformer	Text encoder, U-Net attention	Core architecture
Embeddings	Text → vector, image → latent	Token → vector
Attention	Cross-attention for conditioning	Self-attention for context
Layer Norm	Stabilize training	Stabilize training
Residual Connections	U-Net skip connections	Transformer blocks

Philosophical Connections

Latent Spaces

SD: VAE compresses images to latent space
LLMs: Embeddings represent tokens in latent space

Conditioning

SD: Text embeddings condition image generation
LLMs: Context/prompt conditions text generation

Iterative Refinement

SD: Denoising steps progressively refine image
Reasoning LLMs: Multiple passes refine reasoning

Variants and Extensions

Stable Diffusion Versions

SD 1.x: Original release, 512×512
SD 2.x: Improved text encoder, 768×768
SDXL: Enhanced quality, 1024×1024, dual text encoders
SD 3: Latest, improved architecture and quality

ControlNet

Add spatial conditioning
Guide with edge maps, poses, depth
Precise control over composition
Similar to structured prompting in LLMs

DreamBooth

Personalization with few images
Teach new concepts to the model
Similar to fine-tuning LLMs on specific domains

LoRA (Low-Rank Adaptation)

Efficient fine-tuning technique
Small parameter modifications
Same technique used for efficient LLM fine-tuning!
Enables community-created styles and subjects

Practical Applications

Creative Applications

Art generation and exploration
Concept visualization
Style transfer
Image editing and inpainting

Professional Use Cases

Rapid prototyping and ideation
Marketing and advertising visuals
Game asset generation
Architectural visualization

Research Applications

Understanding visual perception
Studying bias in AI systems
Developing new generative techniques
Multimodal learning research

Implementation Tips

Prompt Crafting

Start with core subject
Add specific details
Include style/medium
Add quality descriptors
Use negative prompts

Parameter Tuning

Steps: 20-30 usually sufficient, 50+ for quality
Guidance: 7-8 balanced, adjust based on prompt
Sampling: Euler-A fast, DPM++ quality
Resolution: Higher = more detail, more compute

Hardware Requirements

Minimum: 4-6 GB VRAM for 512×512
Recommended: 8-12 GB VRAM for 768×768
SDXL: 12-16 GB VRAM for 1024×1024
CPU generation possible but much slower

Multimodal Future

Understanding both LLMs and diffusion models prepares you for multimodal AI: Current Systems

GPT-4 with Vision: LLM + image understanding
DALL-E 3 + ChatGPT: Integrated text and image
Gemini: Native multimodal from ground up

Future Directions

Unified architectures for all modalities
Better text-image alignment
Video generation (Sora, Runway)
Audio and other modalities

The convergence of LLMs and diffusion models into unified multimodal systems is one of the most exciting frontiers in AI research.

Additional Resources

The Illustrated Stable Diffusion - Full visual guide
Stable Diffusion Paper - High-Resolution Image Synthesis with Latent Diffusion Models
AUTOMATIC1111 WebUI - Popular interface
ComfyUI - Node-based interface
Hugging Face Diffusers - Library for diffusion models

Connecting to Language Models

Key takeaways for LLM practitioners:

Attention is universal: Powers both text and image generation
Latent representations: Core to efficiency in both domains
Conditioning techniques: Similar across modalities
Fine-tuning approaches: LoRA works for both LLMs and SD
Prompt engineering: Similar principles apply
Multimodal future: Understanding both prepares you for convergence

Quantization

Efficiency techniques applicable to both LLMs and diffusion models

Mixture of Experts

Scaling techniques used in both text and image models

Conclusion

Stable Diffusion demonstrates that the core innovations in LLMs - transformers, attention, embeddings, efficient training - apply across modalities. Understanding how diffusion models generate images provides both practical skills for multimodal AI and deeper insight into the principles underlying all modern generative models. As the field moves toward unified multimodal systems, this knowledge becomes increasingly valuable.

Visual Guides

Additional Content

Documentation Index

​Overview

​Why Study Stable Diffusion?

​How Diffusion Differs from LLMs

​What You’ll Learn

Core Architecture

Diffusion Process

Text Conditioning

Latent Space

​Illustrated Guide