Employee turnover prediction with logistic regression

This project predicts whether an employee is at risk of leaving an organization using three logistic regression models trained on HR analytics data. The system exposes a REST API that accepts a feature vector describing an employee’s profile and returns independent predictions and confidence scores from each model, giving HR teams a multi-perspective view of retention risk.

Overview

Employee turnover is a classification problem: given a set of features describing an employee, predict whether they will leave (1) or stay (0). The project uses three logistic regression variants to balance interpretability, regularization, and feature sparsity:

Baseline Logistic Regression — unregularized; establishes a performance floor
L1 Regularized (Lasso) — promotes feature sparsity; identifies the smallest predictive feature set
L2 Regularized (Ridge) — stabilizes coefficients; handles correlated features

Problem type: Binary classification
Target variable: Employee turnover (0 = stays, 1 = leaves)
Dataset: employee_turnover.csv

Key features

The model uses 15 input features covering multiple dimensions of an employee’s profile:

Feature category	Examples
Performance	Satisfaction score, last evaluation score, average monthly hours
Work history	Number of projects, time at company, years since last promotion
Compensation	Salary band, department code
Engagement	Work accident indicator, promotion in last 5 years

Input features must be pre-normalized to the [0, 1] range for satisfaction and evaluation scores before sending to the API. The API applies StandardScaler internally, but raw categorical codes should be integer-encoded by the caller.

Model performance

All three models achieve strong predictive performance on the test set:

Model	Accuracy	F1 Score	AUC
Baseline	~0.89	High	0.94
L1 (Lasso)	~0.89+	Best	0.94
L2 (Ridge)	~0.89	Stable	0.94

L1 achieves slightly better F1 due to implicit feature selection, which removes noisy dimensions before classification. All three models share an AUC of 0.94, indicating strong discriminative power across the full probability threshold range.

API design

Prediction pipeline

Internal model pipeline

POST /predict

Accepts a 15-element feature vector and returns predictions from all three models with confidence scores. Request body

features

array

required

A 15-element numeric array representing the employee’s profile. Elements correspond to: satisfaction level, last evaluation, number of projects, average monthly hours, time at company, work accident, left, promotion last 5 years, department (encoded), salary band (encoded), and additional HR metrics.

Example request

curl -X POST http://localhost:5000/predict \
  -H "Content-Type: application/json" \
  -d '{
    "features": [0.56, 0.14, 0.12, 0.78, 0.33, 50000, 3, 35, 2, 1, 10000, 20, 2, 100000000, 200000]
  }'

Example response

{
  "baseline_prediction": 0,
  "l1_prediction": 0,
  "l2_prediction": 0,
  "confidence": {
    "baseline": 0.89,
    "l1": 0.90,
    "l2": 0.89
  }
}

Response fields

baseline_prediction

integer

Binary turnover prediction from the unregularized logistic regression model. 0 = stays, 1 = leaves.

l1_prediction

integer

Binary turnover prediction from the L1-regularized (Lasso) logistic regression model.

l2_prediction

integer

Binary turnover prediction from the L2-regularized (Ridge) logistic regression model.

confidence

object

Per-model probability scores (0.0–1.0) representing the model’s confidence that the prediction is correct.

Error responses

Status	Condition
`400`	Feature array has wrong length or is missing
`405`	Non-POST request to `/predict`

Request handling logic

Running the project

Install dependencies

cd ML_To_Train/02_Employee_Retention_Prediction
pip install -r requirements.txt

Train or verify model files

Open the notebook to retrain if needed:

jupyter notebook Emloyee_retention_pred.ipynb

Trained models are saved to models/baseline_model.pkl, l1_model.pkl, l2_model.pkl, and scaler.pkl.

Start the API server

python src/app.py

Send a prediction request

Send a feature vector matching the format documented in the README:

curl -X POST http://localhost:5000/predict \
  -H "Content-Type: application/json" \
  -d '{"features": [0.56, 0.14, 0.12, 0.78, 0.33, 50000, 3, 35, 2, 1, 10000, 20, 2, 100000000, 200000]}'

Design considerations

Standardization is mandatory. All three models are sensitive to feature scale. The scaler.pkl StandardScaler must be applied before any model receives input.
L1 provides feature sparsity. If interpretability is the goal, the L1 model’s non-zero coefficients identify the most predictive features.
L2 ensures coefficient stability. Use the L2 model when feature collinearity is a concern.
The API is designed for low-latency synchronous inference — no queuing layer is required at typical HR analytics volumes.

Project structure

02_Employee_Retention_Prediction/
│
├── resources/
│
├── src/
│   └── app.py            # Flask API entry point
│
├── Emloyee_retention_pred.ipynb
├── requirements.txt
└── readme.md

Supervised Learning

Unsupervised & Vision

NLP & Generative AI

Time Series & Reinforcement Learning

Employee turnover prediction with logistic regression

Overview

Key features

Model performance

API design

Prediction pipeline

Internal model pipeline

POST /predict

Request handling logic

Running the project

Design considerations

Project structure

Build docs developers (and LLMs) love

Supervised Learning

Unsupervised & Vision

NLP & Generative AI

Time Series & Reinforcement Learning

Documentation Index

​Overview

​Key features

​Model performance

​API design

​Prediction pipeline

​Internal model pipeline

​POST /predict

​Request handling logic

​Running the project

​Design considerations

​Project structure

Build docs developers (and LLMs) love

Overview

Key features

Model performance

API design

Prediction pipeline

Internal model pipeline

POST /predict

Request handling logic

Running the project

Design considerations

Project structure