Training a shallow neural network is relatively straightforward, but stacking dozens or hundreds of layers introduces a set of practical challenges that can derail convergence entirely. Chapter 11 is a deep-dive into the engineering tricks that make deep networks trainable: careful weight initialisation, nonsaturating activation functions, batch normalisation, dropout regularisation, adaptive learning-rate optimizers, and gradient clipping. The chapter demonstrates each technique empirically on Fashion MNIST with a 100-layer network, letting you see the impact firsthand.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/ageron/handson-ml3/llms.txt
Use this file to discover all available pages before exploring further.
What you’ll learn
- Why deep networks suffer from vanishing and exploding gradients
- Glorot (Xavier) and He initialisation for different activation functions
- Nonsaturating activations: Leaky ReLU, ELU, SELU, GELU, Swish, Mish
- Batch Normalisation — how it works and where to insert it in a network
- Gradient clipping to prevent exploding gradients
- Reusing pretrained layers and transfer learning
- Dropout and other regularisation techniques (ℓ1/ℓ2, Max-Norm)
- Learning rate schedules:
ExponentialDecay,ReduceLROnPlateau, 1-cycle policy - Adaptive optimizers: Adam, Nadam, RMSProp, AdaGrad
Key concepts
Vanishing and exploding gradients
When gradients are backpropagated through many layers, they can shrink exponentially (vanish) or grow unboundedly (explode). Saturating activations like sigmoid and tanh are especially prone to vanishing gradients because their derivatives are essentially zero far from the origin. The sigmoid saturation plot in the notebook visualises the problem concretely: the function is nearly flat for large positive or negative inputs, so gradients stall.Initialisation strategies
Glorot (Xavier) initialisation keeps the variance of activations and gradients roughly constant across layers when using sigmoid or tanh. He initialisation is the analogous strategy tuned for ReLU-family activations, which have twice the effective variance. Choosing the wrong initialiser can cause training to diverge or stagnate from the very first iteration.Batch Normalisation
Batch Normalisation (BN) zero-centres and normalises the inputs of each layer, then applies a learned per-feature scale and shift. During training this is computed over the current mini-batch; at inference time, running statistics accumulated during training are used. BN dramatically reduces sensitivity to initialisation, allows much higher learning rates, and acts as a mild regulariser. The typical placement is after the linear transformation and before the activation function, though placing it after activation also works and is common in practice.Dropout
Dropout randomly sets a fraction of neuron outputs to zero at each training step, forcing the network to learn redundant representations and preventing co-adaptation. At inference time all neurons are active but their outputs are scaled by the keep probability. Dropout is most effective in the fully-connected layers of large networks; in convolutional layersSpatialDropout2D is preferred.
Learning rate scheduling and adaptive optimizers
A fixed learning rate is rarely optimal throughout training. Schedules likeExponentialDecay reduce the rate by a constant factor every few thousand steps. ReduceLROnPlateau watches a metric (e.g. validation loss) and halves the rate when it plateaus. Adam combines momentum with per-parameter adaptive learning rates and typically converges faster than plain SGD with a fixed rate; Nadam adds Nesterov momentum on top of Adam.
Code examples
He initialisation
100-layer network with Batch Normalisation
Exponential learning rate decay
Dropout regularisation
Running this notebook
Open in Colab