Differential calculus studies how functions change. In deep learning, it provides the mathematical machinery for training: given a loss function that measures how wrong a model is, calculus tells us in which direction to adjust every parameter to reduce that loss. This page covers the key concepts as presented inDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/ageron/handson-ml3/llms.txt
Use this file to discover all available pages before exploring further.
math_differential_calculus.ipynb and extra_autodiff.ipynb.
Slope of a line and the derivative
The slope of a straight line between points A and B is rise over run:f(x), the derivative f'(x) (also written df/dx) is the instantaneous rate of change at a point. It is defined as the limit of the slope between two points as the distance between them approaches zero:
x. For example, if f(x) = x² then f'(x) = 2x.
Numerical differentiation
When an analytic formula is not available, derivatives can be approximated numerically using the definition above:f(x, y) = x²y + y + 2:
∂f/∂x = 2xy = 24 and ∂f/∂y = x² + 1 = 10.
Numerical differentiation is accurate enough for checking implementations but is too slow for production use in deep learning: computing gradients for
n parameters requires n separate forward passes, which is prohibitive for networks with millions of parameters.Partial derivatives
When a function depends on multiple inputs, the partial derivative∂f/∂x measures the rate of change with respect to one variable while holding all others constant.
For f(x, y) = x²y + y + 2:
x = 3, y = 4: ∂f/∂x = 24 and ∂f/∂y = 10.
Gradient
The gradient of a scalar-valued functionf is the vector of all its partial derivatives:
The chain rule
The chain rule is the cornerstone of backpropagation. Ifz = v(u(x)) (a composition of two functions), then:
z = f_n(f_{n-1}(...f_1(x)...)):
Second derivatives: Jacobians and Hessians
The Jacobian of a vector-valued function is the matrix of all first-order partial derivatives. The Hessian is the matrix of second-order partial derivatives of a scalar function:Automatic differentiation with TensorFlow
TensorFlow’stf.GradientTape records all operations performed on tf.Variable objects and can then differentiate the result:
When a tensor does not depend on a variable at all,
tape.gradient returns None rather than 0. Always handle None values when assembling Hessians or Jacobians in code.Forward mode vs. reverse mode
Both modes implement the chain rule but traverse it in opposite directions:- Forward mode: compute
∂s₁/∂xfirst, then∂s₂/∂s₁, and so on toward the output. Efficient when there are few inputs and many outputs. - Reverse mode (backpropagation): compute
∂z/∂sₙfirst, then back toward the inputs. Efficient when there are many inputs (model parameters) and few outputs (the loss). This is why all major deep-learning frameworks use reverse mode.