Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/MajoRodri/HRIA/llms.txt

Use this file to discover all available pages before exploring further.

HRIA (HR Intelligence & Bias Analysis) is an end-to-end exploratory data analysis project built by DataTalent Solutions S.L. that processes 123,849 LinkedIn job postings to surface concrete, evidence-backed intelligence about the data-role job market. By joining 11 interrelated CSV files — spanning postings, companies, salaries, skills, and industries — HRIA transforms raw, messy hiring data into structured findings that HR analysts, data scientists, and workforce researchers can act on immediately. The project directly addresses a critical gap that plagues HR decision-making: organisations routinely rely on incomplete or biased datasets when benchmarking compensation, ranking candidate skills, or forecasting hiring demand, leading to costly misjudgements. HRIA quantifies exactly where that incompleteness lives and shows how to work around it.

Business Questions

Every notebook in the project is designed to answer one or more of five foundational business questions:
#Question
1What are the most in-demand skills in data roles? Which technical competencies appear most frequently across job descriptions, and how does demand shift by role family?
2Are there salary differences by geography, contract type, or company size? Where are the highest-paying markets, and do full-time versus contract positions command meaningfully different compensation?
3Which industries hire the most data professionals? Which sectors post the highest volume of data roles, and which combine volume with above-average pay?
4How do experience level and skills correlate with salary? Can seniority or a specific skill set reliably predict compensation band?
5What impact does missing or incomplete data have on hiring decisions? How would conclusions change — and how badly would they mislead — if analysts ignored the dataset’s structural gaps?

Project Structure

The project is organised as a linear pipeline of five Jupyter notebooks. Each phase produces cleaned or enriched output that feeds directly into the next.
NotebookPhasePurposeKey Output
Fase1_Exploracion_Inicial.ipynbPhase 1Initial exploration: load all 11 CSVs, audit shape, dtypes, and null distribution
Fase2_Limpieza_Preparacion.ipynbPhase 2Data cleaning, type coercion, salary normalisation, and master dataset assemblydata_maestro_completo.csv, data_roles_completo.csv, data_roles_salario.csv
Fase3_Analisis_Estadistico_Sesgos.ipynbPhase 3Statistical analysis: hypothesis testing, correlation matrices, bias quantification
Fase3_1_Informe_de_Sesgos.ipynbPhase 3.1Comprehensive bias report: eight distinct bias categories identified and visualised
Phase4_Visualization.ipynbPhase 4Executive-ready charts: 11 visualisations covering salary, skills, industries, and ROI
Run the notebooks in the order listed above. Phases 3, 3.1, and 4 each load CSV files written by Phase 2. Starting out of sequence will raise a FileNotFoundError.

Output CSVs

Phase 2 writes three enriched datasets used by all downstream notebooks:
  • data_maestro_completo.csv — master join of postings, companies, industries, and skills
  • data_roles_completo.csv — filtered to data-specific roles with normalised salary columns
  • data_roles_salario.csv — further filtered to postings that contain at least one non-null salary field

Tech Stack

1

Language & Runtime

Python 3.11 — all notebooks declare "version": "3.11.0" in their kernel metadata. No other Python version is tested.
2

Core Libraries

LibraryVersionRole
pandas2.2.2DataFrame operations, CSV I/O, groupby aggregations
NumPy2.0.2Numerical computation, array operations
SciPylatest stableHypothesis testing (Mann-Whitney U, Kruskal-Wallis)
seabornlatest stableStatistical visualisations (boxplots, heatmaps, KDE)
matplotlibbundled with seabornFigure rendering and export
3

Notebook Environments

Every notebook ships with a Google Colab badge and includes a dedicated setup cell that mounts Google Drive and installs seaborn and scipy automatically. The same notebooks run without modification in a local Jupyter environment once the archive/ directory is in place.
Google Colab is the fastest way to get started — no local installation required. Open any notebook via its “Open in Colab” badge, run the first cell to mount your Drive, and you are ready to go within minutes.

Explore the Documentation

Dataset Overview

Understand the 11 CSV files, their schemas, row counts, and how they relate to each other before you run a single cell.

Quickstart

Clone the repo, install dependencies, download the Kaggle dataset, and execute your first notebook in under 10 minutes.

Phase 1 — Initial Exploration

Deep-dive into what Fase1_Exploracion_Inicial.ipynb does: loading all 11 CSVs, auditing 31 columns, and mapping the full null landscape.

Bias Analysis Overview

Learn how HRIA identifies and quantifies eight structural biases — from MNAR salary fields to survivorship bias — and why they matter for HR decisions.
The underlying dataset is the LinkedIn Job Postings collection published on Kaggle. It covers postings scraped primarily from the United States market. Salary figures, contract norms, and industry concentrations reflect US labour-market conditions; findings should be interpreted with that geographic scope in mind before applying them to other markets.

Build docs developers (and LLMs) love