Use this file to discover all available pages before exploring further.
Pandas provides high-performance, easy-to-use data structures and analysis tools built on top of NumPy. The central object is the DataFrame — a 2D table with labelled columns and row indexes, similar to a spreadsheet or a SQL table. In the ML notebooks, Pandas is used to load datasets, inspect them, clean missing values, engineer features, and prepare them for scikit-learn or TensorFlow.
Always use .loc when accessing by label and .iloc when accessing by integer position. Using bare [] works for Series, but the semantics change for DataFrames, so building the explicit habit avoids surprises.
Real datasets contain missing data. Pandas represents missing numeric values as NaN:
df.isnull() # boolean mask of missing valuesdf.isnull().sum() # count of missing values per columndf.fillna(0) # fill NaN with 0df.fillna(df.mean()) # fill each column with its meandf.dropna() # drop any row with at least one NaNdf.dropna(axis=1) # drop any column with at least one NaNdf.dropna(thresh=2) # keep rows with at least 2 non-NaN values
The housing dataset used in Chapter 2 of the book is loaded with pd.read_csv. After loading, df.info() and df.describe() are the first steps to understanding what you have before any preprocessing.