Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/dronabopche/100-ML-AI-Project/llms.txt

Use this file to discover all available pages before exploring further.

The Titanic Survival Prediction project applies binary classification to one of the most widely studied datasets in machine learning. The goal is to predict whether a passenger survived the sinking of the RMS Titanic based on features such as age, sex, passenger class, and ticket fare. This project serves as an introduction to binary classification, feature engineering on mixed-type data, and handling missing values in real-world datasets.

Overview

Survival on the Titanic was not random — socioeconomic status, sex, and age strongly influenced who made it onto lifeboats. This makes the dataset a compelling case study for classification: even simple models can achieve high accuracy by capturing these structural patterns. Problem type: Binary classification
Target variable: Survived (0 = did not survive, 1 = survived)
Dataset: Titanic dataset (Kaggle)
Model: Logistic Regression

Dataset

The Titanic dataset contains demographic and ticketing information for 891 passengers in the training set.
FeatureTypeDescription
PclassCategorical (1, 2, 3)Ticket class — proxy for socioeconomic status
SexCategoricalPassenger sex (male / female)
AgeNumericAge in years (contains missing values)
SibSpNumericNumber of siblings or spouses aboard
ParchNumericNumber of parents or children aboard
FareNumericTicket fare paid
EmbarkedCategoricalPort of embarkation (C, Q, S)
SurvivedBinary (target)Survival outcome

Missing value handling

  • Age: Imputed with the median age (approximately 28 years)
  • Embarked: Imputed with the most frequent value (S)
  • Cabin: Dropped due to high missing rate (~77%)

Feature engineering

Beyond the raw features, the following transformations improve model performance:
  • FamilySize = SibSp + Parch + 1 — captures whether a passenger traveled alone or in a group
  • IsAlone — binary flag derived from FamilySize == 1
  • Title — extracted from passenger name (Mr, Mrs, Miss, Master, Rare) — captures social status and age group more precisely than raw age
  • Sex and Embarked are label-encoded or one-hot encoded before training

Preprocessing pipeline

Model

The project uses Logistic Regression as the primary classifier. Logistic regression is well-suited to this task because:
  • The outcome is binary (survived / did not survive)
  • Coefficients are interpretable — you can quantify the effect of being female vs. male, or first class vs. third class
  • The dataset is small enough that regularization and complex models offer limited advantage
The trained model is serialized to model/lr_model.pkl for inference.
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

model = LogisticRegression(max_iter=1000)
model.fit(X_train_scaled, y_train)

Key findings

Logistic regression on the Titanic dataset consistently reveals the following patterns:
  • Sex is the strongest predictor: Women had a much higher survival rate (~74%) than men (~19%)
  • Passenger class matters: First-class passengers survived at higher rates than third-class
  • Children had a slight survival advantage relative to adult men
  • Traveling alone was associated with lower survival probability
The “women and children first” loading protocol is statistically visible in the dataset. A logistic regression model recovers this pattern from data alone without domain knowledge being encoded manually.

Evaluation metrics

Binary classification models for Titanic are evaluated using:
MetricDescription
AccuracyOverall fraction of correct predictions
PrecisionOf predicted survivors, how many actually survived
RecallOf actual survivors, how many were correctly identified
F1 ScoreHarmonic mean of precision and recall
ROC-AUCArea under the receiver operating characteristic curve
A logistic regression baseline typically achieves ~79–82% accuracy on the Kaggle test set.

Running the project

1

Install dependencies

cd ML_To_Train/05_Titanic_Survival
pip install -r requirements.txt
2

Open the notebook

The project is implemented in Titanic_Survival.ipynb. Open it to run the full pipeline:
jupyter notebook Titanic_Survival.ipynb
3

Run all cells

Execute all notebook cells in order. The final cell saves the trained model to model/lr_model.pkl.
4

Load the model for inference

import joblib
import numpy as np

model = joblib.load("model/lr _model.pkl")

# Example: female, age 29, first class, no siblings, fare 100
features = np.array([[1, 29, 1, 0, 0, 100, 0]])
prediction = model.predict(features)
print("Survived" if prediction[0] == 1 else "Did not survive")

Project structure

05_Titanic_Survival/

├── model/
│   └── lr _model.pkl         # Trained logistic regression model

├── Titanic_Survival.ipynb    # Full EDA, preprocessing, and training notebook
├── requirements.txt
└── readme.md
This project is notebook-only and does not include a Flask API. To serve predictions as an API, load lr_model.pkl in a Flask app and expose a POST /predict endpoint following the pattern used in the House Price Prediction project.

Build docs developers (and LLMs) love