Chapter 2 walks you through every stage of a realistic machine learning project from start to finish, using the California Housing dataset. You will download and explore the data, handle missing values and categorical features, build transformation pipelines withDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/ageron/handson-ml3/llms.txt
Use this file to discover all available pages before exploring further.
Pipeline and ColumnTransformer, train and evaluate several regression models, and fine-tune the best one. By the end you will have a deployable model for predicting California district median house values.
What you’ll learn
- How to frame a regression problem and choose a performance measure (RMSE)
- Downloading and exploring data with pandas (
info,describe,hist,value_counts) - Stratified train/test splitting with
StratifiedShuffleSplitandtrain_test_split - Visualizing geographical data and computing correlation matrices
- Handling missing values with
SimpleImputer(median strategy) - Encoding categorical features with
OneHotEncoder - Building numerical and categorical preprocessing pipelines with
Pipelineandmake_pipeline - Combining heterogeneous pipelines with
ColumnTransformer - Scaling features with
StandardScaler - Training and evaluating
LinearRegression,DecisionTreeRegressor, andRandomForestRegressor - Cross-validating models with
cross_val_score - Hyperparameter search with
GridSearchCVandRandomizedSearchCV
Key concepts
The California Housing dataset. The dataset contains 20,640 district records from the 1990 California census, each with features such as longitude, latitude, housing median age, total rooms, total bedrooms, population, households, and median income. The target variable ismedian_house_value. One column—ocean_proximity—is categorical, which motivates the use of ColumnTransformer to apply different transformations to numerical and categorical columns.
Train/test splitting. A naive random split can produce a biased test set if the dataset has important strata. The notebook demonstrates stratified sampling on the median_income category so that each income stratum is proportionally represented in both the training and test sets.
Transformation pipelines. Scikit-Learn’s Pipeline chains multiple transformers and an optional final estimator into a single object. ColumnTransformer applies different pipelines to different columns—useful when a dataset has both numerical and categorical features.
Cross-validation. cross_val_score performs k-fold cross-validation and returns an array of scores, one per fold. Computing the mean and standard deviation of these scores gives a more reliable estimate of model performance than a single train/test split.
Code examples
Loading the dataset:ColumnTransformer:
The notebook also demonstrates custom transformers that add derived features (rooms per household, bedrooms per room, population per household) to improve model accuracy.
Running this notebook
Open in Colab
Install dependencies
The notebook requires Python ≥ 3.7 and Scikit-Learn ≥ 1.0.1. On Colab these are pre-installed. Locally, run
pip install scikit-learn pandas matplotlib.Exercises
The chapter ends with two exercises: (1) try anSVR with various kernels and hyperparameters and compare results, and (2) add a transformer to the pipeline that selects only the most important attributes. Solutions are included in the notebook.