HRIA: LinkedIn Job Market & Bias Analysis for Data Roles

HRIA (HR Intelligence & Bias Analysis) is an end-to-end exploratory data analysis project built by DataTalent Solutions S.L. that processes 123,849 LinkedIn job postings to surface concrete, evidence-backed intelligence about the data-role job market. By joining 11 interrelated CSV files — spanning postings, companies, salaries, skills, and industries — HRIA transforms raw, messy hiring data into structured findings that HR analysts, data scientists, and workforce researchers can act on immediately. The project directly addresses a critical gap that plagues HR decision-making: organisations routinely rely on incomplete or biased datasets when benchmarking compensation, ranking candidate skills, or forecasting hiring demand, leading to costly misjudgements. HRIA quantifies exactly where that incompleteness lives and shows how to work around it.

Business Questions

Every notebook in the project is designed to answer one or more of five foundational business questions:

#	Question
1	What are the most in-demand skills in data roles? Which technical competencies appear most frequently across job descriptions, and how does demand shift by role family?
2	Are there salary differences by geography, contract type, or company size? Where are the highest-paying markets, and do full-time versus contract positions command meaningfully different compensation?
3	Which industries hire the most data professionals? Which sectors post the highest volume of data roles, and which combine volume with above-average pay?
4	How do experience level and skills correlate with salary? Can seniority or a specific skill set reliably predict compensation band?
5	What impact does missing or incomplete data have on hiring decisions? How would conclusions change — and how badly would they mislead — if analysts ignored the dataset’s structural gaps?

Project Structure

The project is organised as a linear pipeline of five Jupyter notebooks. Each phase produces cleaned or enriched output that feeds directly into the next.

Notebook	Phase	Purpose	Key Output
`Fase1_Exploracion_Inicial.ipynb`	Phase 1	Initial exploration: load all 11 CSVs, audit shape, dtypes, and null distribution	—
`Fase2_Limpieza_Preparacion.ipynb`	Phase 2	Data cleaning, type coercion, salary normalisation, and master dataset assembly	`data_maestro_completo.csv`, `data_roles_completo.csv`, `data_roles_salario.csv`
`Fase3_Analisis_Estadistico_Sesgos.ipynb`	Phase 3	Statistical analysis: hypothesis testing, correlation matrices, bias quantification	—
`Fase3_1_Informe_de_Sesgos.ipynb`	Phase 3.1	Comprehensive bias report: eight distinct bias categories identified and visualised	—
`Phase4_Visualization.ipynb`	Phase 4	Executive-ready charts: 11 visualisations covering salary, skills, industries, and ROI	—

Run the notebooks in the order listed above. Phases 3, 3.1, and 4 each load CSV files written by Phase 2. Starting out of sequence will raise a FileNotFoundError.

Output CSVs

Phase 2 writes three enriched datasets used by all downstream notebooks:

data_maestro_completo.csv — master join of postings, companies, industries, and skills
data_roles_completo.csv — filtered to data-specific roles with normalised salary columns
data_roles_salario.csv — further filtered to postings that contain at least one non-null salary field

Tech Stack

Language & Runtime

Python 3.11 — all notebooks declare "version": "3.11.0" in their kernel metadata. No other Python version is tested.

Core Libraries

Library	Version	Role
pandas	2.2.2	DataFrame operations, CSV I/O, groupby aggregations
NumPy	2.0.2	Numerical computation, array operations
SciPy	latest stable	Hypothesis testing (Mann-Whitney U, Kruskal-Wallis)
seaborn	latest stable	Statistical visualisations (boxplots, heatmaps, KDE)
matplotlib	bundled with seaborn	Figure rendering and export

Notebook Environments

Every notebook ships with a Google Colab badge and includes a dedicated setup cell that mounts Google Drive and installs seaborn and scipy automatically. The same notebooks run without modification in a local Jupyter environment once the archive/ directory is in place.

Google Colab is the fastest way to get started — no local installation required. Open any notebook via its “Open in Colab” badge, run the first cell to mount your Drive, and you are ready to go within minutes.

Explore the Documentation

Dataset Overview

Understand the 11 CSV files, their schemas, row counts, and how they relate to each other before you run a single cell.

Quickstart

Clone the repo, install dependencies, download the Kaggle dataset, and execute your first notebook in under 10 minutes.

Phase 1 — Initial Exploration

Deep-dive into what Fase1_Exploracion_Inicial.ipynb does: loading all 11 CSVs, auditing 31 columns, and mapping the full null landscape.

Bias Analysis Overview

Learn how HRIA identifies and quantifies eight structural biases — from MNAR salary fields to survivorship bias — and why they matter for HR decisions.

The underlying dataset is the LinkedIn Job Postings collection published on Kaggle. It covers postings scraped primarily from the United States market. Salary figures, contract norms, and industry concentrations reflect US labour-market conditions; findings should be interpreted with that geographic scope in mind before applying them to other markets.

Overview

Dataset

Analysis Phases

Bias Analysis

Findings & Recommendations

Business Questions

Project Structure

Output CSVs

Tech Stack

Explore the Documentation

Dataset Overview

Quickstart

Phase 1 — Initial Exploration

Bias Analysis Overview

Build docs developers (and LLMs) love

Overview

Dataset

Analysis Phases

Bias Analysis

Findings & Recommendations

Documentation Index

​Business Questions

​Project Structure

​Output CSVs

​Tech Stack

​Explore the Documentation

Dataset Overview

Quickstart

Phase 1 — Initial Exploration

Bias Analysis Overview

Build docs developers (and LLMs) love

Business Questions

Project Structure

Output CSVs

Tech Stack

Explore the Documentation