Chapter 4 opens the black box and shows exactly how linear models are trained. You will implement gradient descent by hand, then use Scikit-Learn’s optimised implementations. The chapter covers the Normal Equation for linear regression, the three flavours of gradient descent, polynomial feature engineering, and regularisation techniques that prevent overfitting.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/ageron/handson-ml3/llms.txt
Use this file to discover all available pages before exploring further.
What you’ll learn
- Deriving and applying the Normal Equation for closed-form linear regression
- Implementing batch gradient descent, stochastic gradient descent (SGD), and mini-batch gradient descent
- Learning rate schedules and convergence criteria
- Creating polynomial features with
PolynomialFeatures - Diagnosing overfitting and underfitting with learning curves
- Regularised regression: Ridge (L2), Lasso (L1), and ElasticNet
- Early stopping as a regularisation technique
- Logistic regression for binary and multiclass problems (Softmax regression)
Key concepts
The Normal Equation. For linear regression, the optimal weights can be computed analytically: θ = (XᵀX)⁻¹ Xᵀy. This is exact but computationally expensive (O(n³) in the number of features), making gradient descent preferable for large datasets. Gradient descent. Gradient descent iteratively adjusts model parameters by moving in the direction of the steepest decrease in the loss function. Batch gradient descent computes the gradient over the entire training set—slow per step but accurate. Stochastic gradient descent (SGD) uses a single random instance per step—fast but noisy. Mini-batch gradient descent uses small random subsets and balances speed with accuracy, making it the most widely used variant in practice. Polynomial features and regularisation. Adding polynomial features (PolynomialFeatures) lets a linear model fit non-linear data. However, high-degree polynomials overfit easily. Ridge regression adds an L2 penalty on the weights; Lasso adds an L1 penalty that can drive some weights to exactly zero, effectively performing feature selection. ElasticNet combines both penalties.
Logistic and Softmax regression. Logistic regression estimates the probability that an instance belongs to a positive class; the decision boundary is trained by minimising the log loss. Softmax regression generalises logistic regression to multiple classes.
Code examples
Fitting linear regression with the Normal Equation:LinearRegression:
Running this notebook
Open in Colab
No data download required
All datasets in this notebook are generated synthetically with NumPy, so no external downloads are needed.