Credit loan approval with binary classification models

CreditWiseLoan is a binary classification project that predicts whether a loan application should be approved or rejected based on the applicant’s financial and credit profile. The system ingests structured applicant data — including income, credit history, loan amount, and employment details — and outputs a binary approval decision alongside a confidence score. This project demonstrates how machine learning can assist credit risk assessment workflows by identifying patterns in historical lending decisions.

Overview

Loan approval decisions involve balancing risk (probability of default) against opportunity (approving creditworthy applicants). A binary classifier trained on historical approval data can surface the strongest predictors of creditworthiness and help standardize decision-making. Problem type: Binary classification
Target variable: Loan_Status (0 = rejected, 1 = approved)
Dataset: Credit loan application dataset (Kaggle)
Project name: CreditWiseLoan Approval

Dataset

The dataset contains structured records of loan applications with the following features:

Feature	Type	Description
`Gender`	Categorical	Applicant gender (`Male` / `Female`)
`Married`	Categorical	Marital status (`Yes` / `No`)
`Dependents`	Categorical	Number of dependents (0, 1, 2, 3+)
`Education`	Categorical	Education level (`Graduate` / `Not Graduate`)
`Self_Employed`	Categorical	Self-employment status (`Yes` / `No`)
`ApplicantIncome`	Numeric	Monthly income of the primary applicant (USD)
`CoapplicantIncome`	Numeric	Monthly income of the co-applicant (USD)
`LoanAmount`	Numeric	Loan amount requested (thousands USD)
`Loan_Amount_Term`	Numeric	Repayment term in months
`Credit_History`	Binary	Credit history meets guidelines (1 = yes, 0 = no)
`Property_Area`	Categorical	Property location (`Urban` / `Semiurban` / `Rural`)
`Loan_Status`	Binary (target)	Approval decision (1 = approved, 0 = rejected)

Missing value handling

Several columns contain missing values that must be imputed before training:

LoanAmount: Impute with median
Loan_Amount_Term: Impute with mode (360 months)
Credit_History: Impute with mode (1.0)
Categorical columns (Gender, Married, Dependents, Self_Employed): Impute with mode

Feature engineering

Two derived features improve model performance:

TotalIncome = ApplicantIncome + CoapplicantIncome — combined household income
LoanAmountLog = log(LoanAmount) — log-transforms the right-skewed loan amount distribution to reduce the influence of outliers

Preprocessing pipeline

Models

Three classifiers are trained and compared:

Logistic Regression
Random Forest
XGBoost

Logistic regression establishes the baseline. It is interpretable and fast to train, making it suitable for understanding which features drive approval decisions.

from sklearn.linear_model import LogisticRegression
model = LogisticRegression(C=1.0, max_iter=1000)
model.fit(X_train_scaled, y_train)

Random Forest is an ensemble of decision trees that captures non-linear feature interactions (e.g., the combined effect of income level and credit history). It handles class imbalance better than logistic regression and is robust to feature scaling.

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

XGBoost (Extreme Gradient Boosting) is a gradient boosting algorithm that sequentially corrects errors from previous trees. It typically achieves the highest accuracy on tabular financial data and provides feature importance scores.

from xgboost import XGBClassifier
model = XGBClassifier(n_estimators=100, learning_rate=0.1, use_label_encoder=False)
model.fit(X_train, y_train, eval_metric='logloss')

Key predictors

Credit history is the single most important feature in most trained models — applicants without a positive credit history are very rarely approved. Other high-importance features include:

Credit_History — strongest predictor of approval
TotalIncome — higher combined income increases approval probability
LoanAmount — very high loan amounts relative to income reduce approval likelihood
Property_Area — semiurban properties show higher approval rates in some datasets
Education — graduate applicants are approved at slightly higher rates

Model performance

Model	Accuracy	Precision	Recall	AUC
Logistic Regression	~80%	~82%	~88%	0.82
Random Forest	~82%	~84%	~89%	0.85
XGBoost	~83%	~85%	~90%	0.87

The dataset is moderately imbalanced (~69% approved, ~31% rejected). Evaluate using AUC and F1 in addition to accuracy. Consider adjusting the decision threshold if minimizing false approvals (type II errors) is a priority.

API design

POST /predict

Accepts a structured loan application and returns a binary approval decision. Request body

gender

string

required

Applicant gender. Accepted values: "Male", "Female".

married

string

required

Marital status. Accepted values: "Yes", "No".

dependents

string

required

Number of dependents. Accepted values: "0", "1", "2", "3+".

education

string

required

Education level. Accepted values: "Graduate", "Not Graduate".

self_employed

string

required

Self-employment status. Accepted values: "Yes", "No".

applicant_income

number

required

Monthly income of the primary applicant in USD.

coapplicant_income

number

required

Monthly income of the co-applicant in USD. Use 0 if no co-applicant.

loan_amount

number

required

Requested loan amount in thousands of USD.

loan_amount_term

integer

required

Repayment term in months (e.g., 360 for 30 years).

credit_history

integer

required

Whether the applicant’s credit history meets lender guidelines. 1 = yes, 0 = no.

property_area

string

required

Property location. Accepted values: "Urban", "Semiurban", "Rural".

Example request

curl -X POST http://localhost:5000/predict \
  -H "Content-Type: application/json" \
  -d '{
    "gender": "Male",
    "married": "Yes",
    "dependents": "1",
    "education": "Graduate",
    "self_employed": "No",
    "applicant_income": 5000,
    "coapplicant_income": 1500,
    "loan_amount": 150,
    "loan_amount_term": 360,
    "credit_history": 1,
    "property_area": "Semiurban"
  }'

Example response

{
  "loan_status": 1,
  "decision": "Approved",
  "confidence": 0.87
}

Response fields

loan_status

integer

Binary approval decision. 1 = approved, 0 = rejected.

decision

string

Human-readable decision label: "Approved" or "Rejected".

confidence

number

Model confidence score (0.0–1.0) for the predicted outcome.

Running the project

Install dependencies

cd ML_To_Train/26_CreditWiseLoan_appoval
pip install -r requirements.txt

Explore data and train models

jupyter notebook CreditWiseLoan_approval.ipynb

The notebook covers EDA, feature engineering, model training, and evaluation.

Start the API

cd src
python app.py

Submit a loan application

curl -X POST http://localhost:5000/predict \
  -H "Content-Type: application/json" \
  -d '{
    "gender": "Female",
    "married": "No",
    "dependents": "0",
    "education": "Graduate",
    "self_employed": "No",
    "applicant_income": 4500,
    "coapplicant_income": 0,
    "loan_amount": 100,
    "loan_amount_term": 360,
    "credit_history": 1,
    "property_area": "Urban"
  }'

Project structure

26_CreditWiseLoan_appoval/
│
├── src/
│   └── app.py                        # Flask API entry point
│
├── CreditWiseLoan_approval.ipynb     # Full analysis and training notebook
└── readme.md

Feature importance from the Random Forest or XGBoost model can be used to explain individual approval or rejection decisions. Extract model.feature_importances_ and map them to feature names to generate a per-application explanation.

Supervised Learning

Unsupervised & Vision

NLP & Generative AI

Time Series & Reinforcement Learning

Credit loan approval with binary classification models

Overview

Dataset

Missing value handling

Feature engineering

Preprocessing pipeline

Models

Key predictors

Model performance

API design

POST /predict

Running the project

Project structure

Build docs developers (and LLMs) love

Supervised Learning

Unsupervised & Vision

NLP & Generative AI

Time Series & Reinforcement Learning

Documentation Index

​Overview

​Dataset

​Missing value handling

​Feature engineering

​Preprocessing pipeline

​Models

​Key predictors

​Model performance

​API design

​POST /predict

​Running the project

​Project structure

Build docs developers (and LLMs) love

Overview

Dataset

Missing value handling

Feature engineering

Preprocessing pipeline

Models

Key predictors

Model performance

API design

POST /predict

Running the project

Project structure