Titanic survival prediction with binary classification

The Titanic Survival Prediction project applies binary classification to one of the most widely studied datasets in machine learning. The goal is to predict whether a passenger survived the sinking of the RMS Titanic based on features such as age, sex, passenger class, and ticket fare. This project serves as an introduction to binary classification, feature engineering on mixed-type data, and handling missing values in real-world datasets.

Overview

Survival on the Titanic was not random — socioeconomic status, sex, and age strongly influenced who made it onto lifeboats. This makes the dataset a compelling case study for classification: even simple models can achieve high accuracy by capturing these structural patterns. Problem type: Binary classification
Target variable: Survived (0 = did not survive, 1 = survived)
Dataset: Titanic dataset (Kaggle)
Model: Logistic Regression

Dataset

The Titanic dataset contains demographic and ticketing information for 891 passengers in the training set.

Feature	Type	Description
`Pclass`	Categorical (1, 2, 3)	Ticket class — proxy for socioeconomic status
`Sex`	Categorical	Passenger sex (`male` / `female`)
`Age`	Numeric	Age in years (contains missing values)
`SibSp`	Numeric	Number of siblings or spouses aboard
`Parch`	Numeric	Number of parents or children aboard
`Fare`	Numeric	Ticket fare paid
`Embarked`	Categorical	Port of embarkation (`C`, `Q`, `S`)
`Survived`	Binary (target)	Survival outcome

Missing value handling

Age: Imputed with the median age (approximately 28 years)
Embarked: Imputed with the most frequent value (S)
Cabin: Dropped due to high missing rate (~77%)

Feature engineering

Beyond the raw features, the following transformations improve model performance:

FamilySize = SibSp + Parch + 1 — captures whether a passenger traveled alone or in a group
IsAlone — binary flag derived from FamilySize == 1
Title — extracted from passenger name (Mr, Mrs, Miss, Master, Rare) — captures social status and age group more precisely than raw age
Sex and Embarked are label-encoded or one-hot encoded before training

Preprocessing pipeline

Model

The project uses Logistic Regression as the primary classifier. Logistic regression is well-suited to this task because:

The outcome is binary (survived / did not survive)
Coefficients are interpretable — you can quantify the effect of being female vs. male, or first class vs. third class
The dataset is small enough that regularization and complex models offer limited advantage

The trained model is serialized to model/lr_model.pkl for inference.

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

model = LogisticRegression(max_iter=1000)
model.fit(X_train_scaled, y_train)

Key findings

Logistic regression on the Titanic dataset consistently reveals the following patterns:

Sex is the strongest predictor: Women had a much higher survival rate (~74%) than men (~19%)
Passenger class matters: First-class passengers survived at higher rates than third-class
Children had a slight survival advantage relative to adult men
Traveling alone was associated with lower survival probability

The “women and children first” loading protocol is statistically visible in the dataset. A logistic regression model recovers this pattern from data alone without domain knowledge being encoded manually.

Evaluation metrics

Binary classification models for Titanic are evaluated using:

Metric	Description
Accuracy	Overall fraction of correct predictions
Precision	Of predicted survivors, how many actually survived
Recall	Of actual survivors, how many were correctly identified
F1 Score	Harmonic mean of precision and recall
ROC-AUC	Area under the receiver operating characteristic curve

A logistic regression baseline typically achieves ~79–82% accuracy on the Kaggle test set.

Running the project

Install dependencies

cd ML_To_Train/05_Titanic_Survival
pip install -r requirements.txt

Open the notebook

The project is implemented in Titanic_Survival.ipynb. Open it to run the full pipeline:

jupyter notebook Titanic_Survival.ipynb

Run all cells

Execute all notebook cells in order. The final cell saves the trained model to model/lr_model.pkl.

Load the model for inference

import joblib
import numpy as np

model = joblib.load("model/lr _model.pkl")

# Example: female, age 29, first class, no siblings, fare 100
features = np.array([[1, 29, 1, 0, 0, 100, 0]])
prediction = model.predict(features)
print("Survived" if prediction[0] == 1 else "Did not survive")

Project structure

05_Titanic_Survival/
│
├── model/
│   └── lr _model.pkl         # Trained logistic regression model
│
├── Titanic_Survival.ipynb    # Full EDA, preprocessing, and training notebook
├── requirements.txt
└── readme.md

This project is notebook-only and does not include a Flask API. To serve predictions as an API, load lr_model.pkl in a Flask app and expose a POST /predict endpoint following the pattern used in the House Price Prediction project.

Supervised Learning

Unsupervised & Vision

NLP & Generative AI

Time Series & Reinforcement Learning

Titanic survival prediction with binary classification

Overview

Dataset

Missing value handling

Feature engineering

Preprocessing pipeline

Model

Key findings

Evaluation metrics

Running the project

Project structure

Build docs developers (and LLMs) love

Supervised Learning

Unsupervised & Vision

NLP & Generative AI

Time Series & Reinforcement Learning

Documentation Index

​Overview

​Dataset

​Missing value handling

​Feature engineering

​Preprocessing pipeline

​Model

​Key findings

​Evaluation metrics

​Running the project

​Project structure

Build docs developers (and LLMs) love

Overview

Dataset

Missing value handling

Feature engineering

Preprocessing pipeline

Model

Key findings

Evaluation metrics

Running the project

Project structure