Backpropagation is the algorithm that makes gradient descent practical for multi-layer networks. Starting from the loss value produced at the output, it applies the chain rule of calculus layer by layer in reverse, computing the gradient of the loss with respect to every weight and bias in the network. Neural Network Framework implements this entirely in NumPy: each non-input layer exposes aDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/adi3120/Neural-Network-Framework/llms.txt
Use this file to discover all available pages before exploring further.
backward() method that reads the activations already computed during the forward pass and stores the resulting gradients as attributes on the layer object. The training functions then read those gradients to update the parameters.
The Forward-Backward-Update Cycle
Each training step for a single sample passes through three phases: a forward pass that computes predictions, a backward pass that computes gradients, and a parameter update step that applies those gradients.Load the sample and run the forward pass
The input sample is loaded into the
InputLayer using put_values, the expected output is registered on the OutputLayer with set_actual, and then every layer’s forward() method is called in order from input to output. After this step every layer holds pre_activations and activations.Run the backward pass in reverse
Every layer except the
InputLayer is visited in reverse order — from the output layer back through the hidden layers — and its backward() method is called. Each call reads gradients from the next layer (or from the loss function, in the output layer’s case) and deposits its own gradients on self.dLda and self.dLdW.OutputLayer.backward()
TheOutputLayer sits at the end of the network and is the only layer with direct access to the loss function. Its backward() method is therefore responsible for computing the very first gradient in the chain — the derivative of the loss with respect to the layer’s own pre-activation values.
Step-by-step derivation
1. Loss derivativedLdy
The derivative of the loss with respect to the output activations is computed by the loss function’s own derivative. For example, with Mean Squared Error:
dLdy measures how much the loss changes as each output value changes.
2. Activation derivative dyda
The local gradient of the output activation function with respect to the pre-activation (the raw dot-product result before the activation is applied):
dLda
For all activation functions except softmax, the two derivatives are multiplied element-wise via the chain rule:
dLdW
The gradient of the loss with respect to the weight matrix is the outer product of dLda and the activations from the previous layer:
backward() returns, the following attributes are available on the OutputLayer object:
| Attribute | Shape | Meaning |
|---|---|---|
self.dLdy | scalar or (n_out, 1) | Derivative of loss w.r.t. output activations |
self.dyda | (n_out, 1) | Derivative of output activation w.r.t. pre-activations |
self.dLda | (n_out, 1) | Combined gradient: dLdy * dyda |
self.dLdW | (n_out, n_prev) | Gradient of loss w.r.t. this layer’s weight matrix |
Softmax special case
When the output activation issoftmax the standard element-wise multiplication does not apply because softmax couples all output neurons together. The framework uses the well-known simplification valid when softmax is paired with cross-entropy loss — the gradient collapses to the difference between predicted probabilities and the one-hot target:
HiddenLayer.backward()
Hidden layers sit between the input and output and have no direct connection to the loss. They receive their incoming gradient signal by readingdLda and W from the layer immediately after them (the “next” layer), then compute their own gradients using the chain rule.
Step-by-step derivation
1. Gradient from the next layerdLdh
The gradient of the loss with respect to this layer’s activations is obtained by projecting the next layer’s dLda back through the next layer’s weight matrix:
dhda
The derivative of this layer’s activation function with respect to its own pre-activations:
dLda
Element-wise product of the back-propagated signal and the local derivative:
dLdW
Outer product of dLda and this layer’s input activations (the previous layer’s activations):
backward() returns:
| Attribute | Shape | Meaning |
|---|---|---|
self.dLda | (n_hidden, 1) | Gradient of loss w.r.t. this layer’s pre-activations |
self.dLdW | (n_hidden, n_prev) | Gradient of loss w.r.t. this layer’s weight matrix |
backward() call on the preceding hidden layer.
InputLayer has no backward() method and is intentionally skipped during the backward pass. The input layer holds no learnable parameters — it simply stores the raw input values and passes them forward. The backward loop for i in range(len(ANN) - 1, 0, -1) uses a lower bound of 1 (not 0) precisely to exclude index 0, which is always the InputLayer.Gradient Flow Summary
The diagram below traces how the loss gradient flows from the output back to the first hidden layer for a network with one hidden layer:dLda and W), so the backward pass never needs to revisit a layer it has already processed.
Supported Activation Derivatives
The following derivative functions are available and are selected automatically by each layer’sbackward() method based on the actname attribute set at construction time: