Pandas tutorial: DataFrames and data preparation for ML

Pandas provides high-performance, easy-to-use data structures and analysis tools built on top of NumPy. The central object is the DataFrame — a 2D table with labelled columns and row indexes, similar to a spreadsheet or a SQL table. In the ML notebooks, Pandas is used to load datasets, inspect them, clean missing values, engineer features, and prepare them for scikit-learn or TensorFlow.

Series

A Series is a 1D labelled array. It behaves much like a NumPy ndarray but carries an index:

import pandas as pd
import numpy as np

s = pd.Series([2, -1, 3, 5])
# 0    2
# 1   -1
# 2    3
# 3    5
# dtype: int64

NumPy functions work directly on Series:

np.exp(s)
# 0      7.389056
# 1      0.367879
# 2     20.085537
# 3    148.413159
# dtype: float64

Arithmetic operations are element-wise, and a scalar is broadcast to every element (same as NumPy):

s + 1000
# 0    1002
# 1     999
# 2    1003
# 3    1005

s < 0
# 0    False
# 1     True
# 2    False
# 3    False

Custom index labels

Pass an index to use string (or any hashable) keys:

s2 = pd.Series([68, 83, 112, 68],
               index=["alice", "bob", "charles", "darwin"])

s2["bob"]      # 83
s2.loc["bob"]  # 83  — explicit label access (recommended)
s2.iloc[1]     # 83  — position access (0-based)

Always use .loc when accessing by label and .iloc when accessing by integer position. Using bare [] works for Series, but the semantics change for DataFrames, so building the explicit habit avoids surprises.

DataFrame

A DataFrame is a 2D table where each column is a Series sharing a common row index.

Creating a DataFrame

people_dict = {
    "weight": pd.Series([68, 83, 112, 68],
                        index=["alice", "bob", "charles", "darwin"]),
    "birthyear": pd.Series([1984, 1985, 1992, 1996],
                           index=["bob", "alice", "darwin", "charles"]),
    "children": pd.Series([0, 3], index=["alice", "bob"]),
    "hobby": pd.Series(["Biking", "Dancing"],
                       index=["alice", "bob"]),
}
people = pd.DataFrame(people_dict)

Inspection

These three methods are the first things you call after loading a dataset:

df.head()       # first 5 rows
df.tail(3)      # last 3 rows
df.info()       # column names, dtypes, non-null counts
df.describe()   # count, mean, std, min, quartiles, max

Indexing DataFrames

Access a single column as a Series, or multiple columns as a new DataFrame:

df["age"]             # Series
df[["name", "age"]]   # DataFrame

Row selection with loc (label-based) and iloc (position-based):

df.loc[0]          # row with index label 0
df.loc[0:2]        # rows 0, 1, 2 (inclusive both ends)
df.iloc[0:2]       # rows 0 and 1 (exclusive end, like Python slices)

Boolean selection (very common in ML for filtering outliers or subsets):

df[df["age"] > 25]                         # all rows where age > 25
df[(df["age"] > 25) & (df["score"] > 0.9)] # combined condition

Handling missing values

Real datasets contain missing data. Pandas represents missing numeric values as NaN:

df.isnull()          # boolean mask of missing values
df.isnull().sum()    # count of missing values per column

df.fillna(0)                   # fill NaN with 0
df.fillna(df.mean())           # fill each column with its mean
df.dropna()                    # drop any row with at least one NaN
df.dropna(axis=1)              # drop any column with at least one NaN
df.dropna(thresh=2)            # keep rows with at least 2 non-NaN values

`groupby`

groupby splits the DataFrame, applies an aggregation function, and combines the results — a common pattern when computing per-class statistics:

df.groupby("hobby")["weight"].mean()
# Biking     68.0
# Dancing    83.0

df.groupby("hobby").agg({"weight": "mean", "birthyear": "min"})

Merge and join

Merge two DataFrames on a common key, exactly like a SQL join:

left  = pd.DataFrame({"key": ["A", "B", "C"], "val_left":  [1, 2, 3]})
right = pd.DataFrame({"key": ["B", "C", "D"], "val_right": [4, 5, 6]})

pd.merge(left, right, on="key", how="inner")  # only matching keys
pd.merge(left, right, on="key", how="left")   # all left keys
pd.merge(left, right, on="key", how="outer")  # all keys from both

`value_counts`

Count the occurrences of each unique value — useful for checking class balance:

df["hobby"].value_counts()
# Biking     1
# Dancing    1

Converting to NumPy

Pass a DataFrame or Series to scikit-learn or TensorFlow by calling .to_numpy():

X = df[["age", "score"]].to_numpy()   # shape (n_samples, n_features)
y = df["label"].to_numpy()

Reading and writing files

df = pd.read_csv("housing.csv")
df.to_csv("output.csv", index=False)

df = pd.read_excel("data.xlsx")

The housing dataset used in Chapter 2 of the book is loaded with pd.read_csv. After loading, df.info() and df.describe() are the first steps to understanding what you have before any preprocessing.

Tools & Libraries

Math Prerequisites

Extra Resources

Pandas tutorial: DataFrames and data preparation for ML

Series

Custom index labels

DataFrame

Creating a DataFrame

Inspection

Indexing DataFrames

Handling missing values

`groupby`

Merge and join

`value_counts`

Converting to NumPy

Reading and writing files

Build docs developers (and LLMs) love

Tools & Libraries

Math Prerequisites

Extra Resources

Documentation Index

​Series

​Custom index labels

​DataFrame

​Creating a DataFrame

​Inspection

​Indexing DataFrames

​Handling missing values

​groupby

​Merge and join

​value_counts

​Converting to NumPy

​Reading and writing files

Build docs developers (and LLMs) love

Series

Custom index labels

DataFrame

Creating a DataFrame

Inspection

Indexing DataFrames

Handling missing values

`groupby`

Merge and join

`value_counts`

Converting to NumPy

Reading and writing files