The Titanic Survival Prediction project applies binary classification to one of the most widely studied datasets in machine learning. The goal is to predict whether a passenger survived the sinking of the RMS Titanic based on features such as age, sex, passenger class, and ticket fare. This project serves as an introduction to binary classification, feature engineering on mixed-type data, and handling missing values in real-world datasets.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/dronabopche/100-ML-AI-Project/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Survival on the Titanic was not random — socioeconomic status, sex, and age strongly influenced who made it onto lifeboats. This makes the dataset a compelling case study for classification: even simple models can achieve high accuracy by capturing these structural patterns. Problem type: Binary classificationTarget variable:
Survived (0 = did not survive, 1 = survived)Dataset: Titanic dataset (Kaggle)
Model: Logistic Regression
Dataset
The Titanic dataset contains demographic and ticketing information for 891 passengers in the training set.| Feature | Type | Description |
|---|---|---|
Pclass | Categorical (1, 2, 3) | Ticket class — proxy for socioeconomic status |
Sex | Categorical | Passenger sex (male / female) |
Age | Numeric | Age in years (contains missing values) |
SibSp | Numeric | Number of siblings or spouses aboard |
Parch | Numeric | Number of parents or children aboard |
Fare | Numeric | Ticket fare paid |
Embarked | Categorical | Port of embarkation (C, Q, S) |
Survived | Binary (target) | Survival outcome |
Missing value handling
Age: Imputed with the median age (approximately 28 years)Embarked: Imputed with the most frequent value (S)Cabin: Dropped due to high missing rate (~77%)
Feature engineering
Beyond the raw features, the following transformations improve model performance:FamilySize=SibSp+Parch+ 1 — captures whether a passenger traveled alone or in a groupIsAlone— binary flag derived fromFamilySize == 1Title— extracted from passenger name (Mr,Mrs,Miss,Master,Rare) — captures social status and age group more precisely than raw ageSexandEmbarkedare label-encoded or one-hot encoded before training
Preprocessing pipeline
Model
The project uses Logistic Regression as the primary classifier. Logistic regression is well-suited to this task because:- The outcome is binary (survived / did not survive)
- Coefficients are interpretable — you can quantify the effect of being female vs. male, or first class vs. third class
- The dataset is small enough that regularization and complex models offer limited advantage
model/lr_model.pkl for inference.
Key findings
Logistic regression on the Titanic dataset consistently reveals the following patterns:- Sex is the strongest predictor: Women had a much higher survival rate (~74%) than men (~19%)
- Passenger class matters: First-class passengers survived at higher rates than third-class
- Children had a slight survival advantage relative to adult men
- Traveling alone was associated with lower survival probability
Evaluation metrics
Binary classification models for Titanic are evaluated using:| Metric | Description |
|---|---|
| Accuracy | Overall fraction of correct predictions |
| Precision | Of predicted survivors, how many actually survived |
| Recall | Of actual survivors, how many were correctly identified |
| F1 Score | Harmonic mean of precision and recall |
| ROC-AUC | Area under the receiver operating characteristic curve |
Running the project
Open the notebook
The project is implemented in
Titanic_Survival.ipynb. Open it to run the full pipeline:Run all cells
Execute all notebook cells in order. The final cell saves the trained model to
model/lr_model.pkl.Project structure
This project is notebook-only and does not include a Flask API. To serve predictions as an API, load
lr_model.pkl in a Flask app and expose a POST /predict endpoint following the pattern used in the House Price Prediction project.