Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/dronabopche/100-ML-AI-Project/llms.txt

Use this file to discover all available pages before exploring further.

This project predicts whether an employee is at risk of leaving an organization using three logistic regression models trained on HR analytics data. The system exposes a REST API that accepts a feature vector describing an employee’s profile and returns independent predictions and confidence scores from each model, giving HR teams a multi-perspective view of retention risk.

Overview

Employee turnover is a classification problem: given a set of features describing an employee, predict whether they will leave (1) or stay (0). The project uses three logistic regression variants to balance interpretability, regularization, and feature sparsity:
  • Baseline Logistic Regression — unregularized; establishes a performance floor
  • L1 Regularized (Lasso) — promotes feature sparsity; identifies the smallest predictive feature set
  • L2 Regularized (Ridge) — stabilizes coefficients; handles correlated features
Problem type: Binary classification
Target variable: Employee turnover (0 = stays, 1 = leaves)
Dataset: employee_turnover.csv

Key features

The model uses 15 input features covering multiple dimensions of an employee’s profile:
Feature categoryExamples
PerformanceSatisfaction score, last evaluation score, average monthly hours
Work historyNumber of projects, time at company, years since last promotion
CompensationSalary band, department code
EngagementWork accident indicator, promotion in last 5 years
Input features must be pre-normalized to the [0, 1] range for satisfaction and evaluation scores before sending to the API. The API applies StandardScaler internally, but raw categorical codes should be integer-encoded by the caller.

Model performance

All three models achieve strong predictive performance on the test set:
ModelAccuracyF1 ScoreAUC
Baseline~0.89High0.94
L1 (Lasso)~0.89+Best0.94
L2 (Ridge)~0.89Stable0.94
L1 achieves slightly better F1 due to implicit feature selection, which removes noisy dimensions before classification. All three models share an AUC of 0.94, indicating strong discriminative power across the full probability threshold range.

API design

Prediction pipeline

Internal model pipeline

POST /predict

Accepts a 15-element feature vector and returns predictions from all three models with confidence scores. Request body
features
array
required
A 15-element numeric array representing the employee’s profile. Elements correspond to: satisfaction level, last evaluation, number of projects, average monthly hours, time at company, work accident, left, promotion last 5 years, department (encoded), salary band (encoded), and additional HR metrics.
Example request
curl -X POST http://localhost:5000/predict \
  -H "Content-Type: application/json" \
  -d '{
    "features": [0.56, 0.14, 0.12, 0.78, 0.33, 50000, 3, 35, 2, 1, 10000, 20, 2, 100000000, 200000]
  }'
Example response
{
  "baseline_prediction": 0,
  "l1_prediction": 0,
  "l2_prediction": 0,
  "confidence": {
    "baseline": 0.89,
    "l1": 0.90,
    "l2": 0.89
  }
}
Response fields
baseline_prediction
integer
Binary turnover prediction from the unregularized logistic regression model. 0 = stays, 1 = leaves.
l1_prediction
integer
Binary turnover prediction from the L1-regularized (Lasso) logistic regression model.
l2_prediction
integer
Binary turnover prediction from the L2-regularized (Ridge) logistic regression model.
confidence
object
Per-model probability scores (0.0–1.0) representing the model’s confidence that the prediction is correct.
Error responses
StatusCondition
400Feature array has wrong length or is missing
405Non-POST request to /predict

Request handling logic

Running the project

1

Install dependencies

cd ML_To_Train/02_Employee_Retention_Prediction
pip install -r requirements.txt
2

Train or verify model files

Open the notebook to retrain if needed:
jupyter notebook Emloyee_retention_pred.ipynb
Trained models are saved to models/baseline_model.pkl, l1_model.pkl, l2_model.pkl, and scaler.pkl.
3

Start the API server

python src/app.py
4

Send a prediction request

Send a feature vector matching the format documented in the README:
curl -X POST http://localhost:5000/predict \
  -H "Content-Type: application/json" \
  -d '{"features": [0.56, 0.14, 0.12, 0.78, 0.33, 50000, 3, 35, 2, 1, 10000, 20, 2, 100000000, 200000]}'

Design considerations

  • Standardization is mandatory. All three models are sensitive to feature scale. The scaler.pkl StandardScaler must be applied before any model receives input.
  • L1 provides feature sparsity. If interpretability is the goal, the L1 model’s non-zero coefficients identify the most predictive features.
  • L2 ensures coefficient stability. Use the L2 model when feature collinearity is a concern.
  • The API is designed for low-latency synchronous inference — no queuing layer is required at typical HR analytics volumes.

Project structure

02_Employee_Retention_Prediction/

├── resources/

├── src/
│   └── app.py            # Flask API entry point

├── Emloyee_retention_pred.ipynb
├── requirements.txt
└── readme.md

Build docs developers (and LLMs) love