HRIA (HR Intelligence & Bias Analysis) is an end-to-end exploratory data analysis project built by DataTalent Solutions S.L. that processes 123,849 LinkedIn job postings to surface concrete, evidence-backed intelligence about the data-role job market. By joining 11 interrelated CSV files — spanning postings, companies, salaries, skills, and industries — HRIA transforms raw, messy hiring data into structured findings that HR analysts, data scientists, and workforce researchers can act on immediately. The project directly addresses a critical gap that plagues HR decision-making: organisations routinely rely on incomplete or biased datasets when benchmarking compensation, ranking candidate skills, or forecasting hiring demand, leading to costly misjudgements. HRIA quantifies exactly where that incompleteness lives and shows how to work around it.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/MajoRodri/HRIA/llms.txt
Use this file to discover all available pages before exploring further.
Business Questions
Every notebook in the project is designed to answer one or more of five foundational business questions:| # | Question |
|---|---|
| 1 | What are the most in-demand skills in data roles? Which technical competencies appear most frequently across job descriptions, and how does demand shift by role family? |
| 2 | Are there salary differences by geography, contract type, or company size? Where are the highest-paying markets, and do full-time versus contract positions command meaningfully different compensation? |
| 3 | Which industries hire the most data professionals? Which sectors post the highest volume of data roles, and which combine volume with above-average pay? |
| 4 | How do experience level and skills correlate with salary? Can seniority or a specific skill set reliably predict compensation band? |
| 5 | What impact does missing or incomplete data have on hiring decisions? How would conclusions change — and how badly would they mislead — if analysts ignored the dataset’s structural gaps? |
Project Structure
The project is organised as a linear pipeline of five Jupyter notebooks. Each phase produces cleaned or enriched output that feeds directly into the next.| Notebook | Phase | Purpose | Key Output |
|---|---|---|---|
Fase1_Exploracion_Inicial.ipynb | Phase 1 | Initial exploration: load all 11 CSVs, audit shape, dtypes, and null distribution | — |
Fase2_Limpieza_Preparacion.ipynb | Phase 2 | Data cleaning, type coercion, salary normalisation, and master dataset assembly | data_maestro_completo.csv, data_roles_completo.csv, data_roles_salario.csv |
Fase3_Analisis_Estadistico_Sesgos.ipynb | Phase 3 | Statistical analysis: hypothesis testing, correlation matrices, bias quantification | — |
Fase3_1_Informe_de_Sesgos.ipynb | Phase 3.1 | Comprehensive bias report: eight distinct bias categories identified and visualised | — |
Phase4_Visualization.ipynb | Phase 4 | Executive-ready charts: 11 visualisations covering salary, skills, industries, and ROI | — |
Run the notebooks in the order listed above. Phases 3, 3.1, and 4 each load CSV files written by Phase 2. Starting out of sequence will raise a
FileNotFoundError.Output CSVs
Phase 2 writes three enriched datasets used by all downstream notebooks:data_maestro_completo.csv— master join of postings, companies, industries, and skillsdata_roles_completo.csv— filtered to data-specific roles with normalised salary columnsdata_roles_salario.csv— further filtered to postings that contain at least one non-null salary field
Tech Stack
Language & Runtime
Python 3.11 — all notebooks declare
"version": "3.11.0" in their kernel metadata. No other Python version is tested.Core Libraries
| Library | Version | Role |
|---|---|---|
| pandas | 2.2.2 | DataFrame operations, CSV I/O, groupby aggregations |
| NumPy | 2.0.2 | Numerical computation, array operations |
| SciPy | latest stable | Hypothesis testing (Mann-Whitney U, Kruskal-Wallis) |
| seaborn | latest stable | Statistical visualisations (boxplots, heatmaps, KDE) |
| matplotlib | bundled with seaborn | Figure rendering and export |
Explore the Documentation
Dataset Overview
Understand the 11 CSV files, their schemas, row counts, and how they relate to each other before you run a single cell.
Quickstart
Clone the repo, install dependencies, download the Kaggle dataset, and execute your first notebook in under 10 minutes.
Phase 1 — Initial Exploration
Deep-dive into what
Fase1_Exploracion_Inicial.ipynb does: loading all 11 CSVs, auditing 31 columns, and mapping the full null landscape.Bias Analysis Overview
Learn how HRIA identifies and quantifies eight structural biases — from MNAR salary fields to survivorship bias — and why they matter for HR decisions.
The underlying dataset is the LinkedIn Job Postings collection published on Kaggle. It covers postings scraped primarily from the United States market. Salary figures, contract norms, and industry concentrations reflect US labour-market conditions; findings should be interpreted with that geographic scope in mind before applying them to other markets.