Skip to main content

Overview

While this book focuses on Large Language Models, the principles underlying modern generative AI extend beyond text. Stable Diffusion, one of the most influential text-to-image models, shares many architectural components and training techniques with LLMs. Understanding how diffusion models work provides valuable insights into the broader landscape of generative AI and helps connect concepts across modalities.
This illustrated guide is part of the bonus material for Hands-On Large Language Models. While it covers image generation rather than text, many concepts translate directly to language models and multimodal systems.

Why Study Stable Diffusion?

Understanding Stable Diffusion enriches your knowledge of generative AI:
  • Shared Components: Uses transformers, attention mechanisms, and embeddings like LLMs
  • Diffusion Process: A different generative paradigm from autoregressive text generation
  • Multimodal Learning: Bridges text and images through cross-attention
  • Conditioning: Similar to prompting in LLMs, but for image generation
  • Latent Spaces: Similar conceptual foundations to LLM embeddings
Models like Stable Diffusion, DALL-E, Midjourney, and Imagen have revolutionized image generation.

How Diffusion Differs from LLMs

Language Model Generation
Token by token, left to right, autoregressive
"The" → "cat" → "sat" → "on" → "the" → "mat"
Diffusion Model Generation
Start with noise, gradually denoise to create image
Noise → Slightly less noisy → ... → Clear image
Both use transformers and learned representations, but the generation process is fundamentally different.

What You’ll Learn

The illustrated guide provides a visual walkthrough of Stable Diffusion:

Core Architecture

VAE, U-Net, text encoder, and how components work together

Diffusion Process

How noise is progressively removed to generate images

Text Conditioning

How text prompts guide image generation through cross-attention

Latent Space

Operating in compressed latent space for efficiency

Illustrated Guide

The Illustrated Stable Diffusion

Read the full illustrated guide by Jay Alammar, known for his exceptional visual explanations of complex AI systems.
While Stable Diffusion generates images, many concepts connect to the book:
  • Chapter 2: Tokens and Embeddings - Text encoding principles apply
  • Chapter 3: Looking Inside LLMs - Transformer architecture and attention
  • Chapter 6: Prompt Engineering - Conditioning with text prompts
  • Chapter 8: Customizing LLMs - Fine-tuning and adaptation techniques

Core Components

1. Variational Autoencoder (VAE)

The VAE compresses images to a latent space: Encoder
  • Compresses 512×512 image to 64×64 latent representation
  • Reduces computation by ~8x in each dimension
  • Preserves semantic information
Decoder
  • Converts latent representation back to full image
  • Upsamples 64×64 to 512×512
  • Reconstructs fine details
Why it matters: Operating in latent space makes diffusion tractable and efficient.

2. U-Net (Denoising Network)

The core of the diffusion process:
  • Input: Noisy latent + timestep + text conditioning
  • Output: Predicted noise to remove
  • Architecture: Encoder-decoder with skip connections
  • Attention layers: Cross-attention with text embeddings
The U-Net is applied iteratively, progressively denoising the image.

3. Text Encoder (CLIP)

Converts text prompts to embeddings:
  • Uses CLIP’s text encoder (similar to BERT)
  • Produces embeddings that capture semantic meaning
  • Same space where images are represented
  • Enables text-to-image alignment

4. Scheduler (Sampler)

Controls the denoising process:
  • Determines number of steps (typically 20-50)
  • Schedules noise levels over time
  • Different samplers: DDPM, DDIM, Euler, etc.
  • Trade-offs between quality and speed

The Diffusion Process

Forward Diffusion (Training)

Gradually add noise to images:
Clean Image → +noise → +more noise → ... → Pure Noise
The model learns to reverse this process.

Reverse Diffusion (Generation)

Start with noise, progressively denoise:
Pure Noise → -predicted noise → -more predicted noise → ... → Generated Image
At each step:
  1. U-Net predicts noise in current latent
  2. Remove (some of) that noise
  3. Repeat for N steps
  4. Decode final latent to image

Conditioning with Text

Text embeddings condition the denoising:
  • Cross-attention layers in U-Net
  • Text embeddings guide what to generate
  • Classifier-free guidance: strengthens conditioning
  • Negative prompts: guides away from concepts

Key Concepts

Latent Space Efficiency

Direct Image Generation (DALL-E 1 style)
  • Operate on full resolution pixels
  • Computationally expensive
  • Requires massive resources
Latent Diffusion (Stable Diffusion)
  • Compress to latent space with VAE
  • Denoise in low-dimensional space
  • Much more efficient
  • Enables consumer hardware generation
This efficiency innovation is similar to quantization in LLMs - making powerful models accessible.

Attention Mechanisms

Self-Attention
  • Within the image latent
  • Allows spatial coherence
  • Similar to self-attention in transformers
Cross-Attention
  • Between image latent and text embeddings
  • Aligns image features with text concepts
  • Key to text-guided generation
These are the same attention mechanisms used in LLMs!

Classifier-Free Guidance

Controls adherence to prompt:
prediction = uncond_prediction + guidance_scale × (cond_prediction - uncond_prediction)
  • Low guidance (1.0-3.0): More creative, less prompt adherence
  • Medium guidance (5.0-10.0): Balanced
  • High guidance (15.0+): Strong prompt adherence, may reduce quality
Similar to temperature in LLM generation - controlling randomness vs. determinism.

Prompt Engineering

Like LLMs, diffusion models respond to prompt engineering: Effective Prompts
  • Specific descriptions: “a red apple on a wooden table”
  • Style modifiers: “digital art”, “oil painting”, “photorealistic”
  • Quality boosters: “high quality”, “detailed”, “sharp focus”
  • Artist names: “in the style of [artist]”
Negative Prompts
  • Guide away from unwanted features
  • Common: “blurry, low quality, distorted”
  • Similar to “avoid X” instructions in LLM prompts
Prompt engineering for Stable Diffusion follows similar principles to LLM prompting - being specific, using examples (reference images), and iterative refinement.

Architecture Connections to LLMs

Shared Components

ComponentIn Stable DiffusionIn LLMs
TransformerText encoder, U-Net attentionCore architecture
EmbeddingsText → vector, image → latentToken → vector
AttentionCross-attention for conditioningSelf-attention for context
Layer NormStabilize trainingStabilize training
Residual ConnectionsU-Net skip connectionsTransformer blocks

Philosophical Connections

Latent Spaces
  • SD: VAE compresses images to latent space
  • LLMs: Embeddings represent tokens in latent space
Conditioning
  • SD: Text embeddings condition image generation
  • LLMs: Context/prompt conditions text generation
Iterative Refinement
  • SD: Denoising steps progressively refine image
  • Reasoning LLMs: Multiple passes refine reasoning

Variants and Extensions

Stable Diffusion Versions

  • SD 1.x: Original release, 512×512
  • SD 2.x: Improved text encoder, 768×768
  • SDXL: Enhanced quality, 1024×1024, dual text encoders
  • SD 3: Latest, improved architecture and quality

ControlNet

  • Add spatial conditioning
  • Guide with edge maps, poses, depth
  • Precise control over composition
  • Similar to structured prompting in LLMs

DreamBooth

  • Personalization with few images
  • Teach new concepts to the model
  • Similar to fine-tuning LLMs on specific domains

LoRA (Low-Rank Adaptation)

  • Efficient fine-tuning technique
  • Small parameter modifications
  • Same technique used for efficient LLM fine-tuning!
  • Enables community-created styles and subjects

Practical Applications

Creative Applications

  • Art generation and exploration
  • Concept visualization
  • Style transfer
  • Image editing and inpainting

Professional Use Cases

  • Rapid prototyping and ideation
  • Marketing and advertising visuals
  • Game asset generation
  • Architectural visualization

Research Applications

  • Understanding visual perception
  • Studying bias in AI systems
  • Developing new generative techniques
  • Multimodal learning research

Implementation Tips

Prompt Crafting

  1. Start with core subject
  2. Add specific details
  3. Include style/medium
  4. Add quality descriptors
  5. Use negative prompts

Parameter Tuning

  • Steps: 20-30 usually sufficient, 50+ for quality
  • Guidance: 7-8 balanced, adjust based on prompt
  • Sampling: Euler-A fast, DPM++ quality
  • Resolution: Higher = more detail, more compute

Hardware Requirements

  • Minimum: 4-6 GB VRAM for 512×512
  • Recommended: 8-12 GB VRAM for 768×768
  • SDXL: 12-16 GB VRAM for 1024×1024
  • CPU generation possible but much slower

Multimodal Future

Understanding both LLMs and diffusion models prepares you for multimodal AI: Current Systems
  • GPT-4 with Vision: LLM + image understanding
  • DALL-E 3 + ChatGPT: Integrated text and image
  • Gemini: Native multimodal from ground up
Future Directions
  • Unified architectures for all modalities
  • Better text-image alignment
  • Video generation (Sora, Runway)
  • Audio and other modalities
The convergence of LLMs and diffusion models into unified multimodal systems is one of the most exciting frontiers in AI research.

Additional Resources

Connecting to Language Models

Key takeaways for LLM practitioners:
  1. Attention is universal: Powers both text and image generation
  2. Latent representations: Core to efficiency in both domains
  3. Conditioning techniques: Similar across modalities
  4. Fine-tuning approaches: LoRA works for both LLMs and SD
  5. Prompt engineering: Similar principles apply
  6. Multimodal future: Understanding both prepares you for convergence

Quantization

Efficiency techniques applicable to both LLMs and diffusion models

Mixture of Experts

Scaling techniques used in both text and image models

Conclusion

Stable Diffusion demonstrates that the core innovations in LLMs - transformers, attention, embeddings, efficient training - apply across modalities. Understanding how diffusion models generate images provides both practical skills for multimodal AI and deeper insight into the principles underlying all modern generative models. As the field moves toward unified multimodal systems, this knowledge becomes increasingly valuable.

Build docs developers (and LLMs) love