Get Started with HRIA: Setup and First Notebook Run

By the end of this guide you will have the HRIA repository on your machine, all Python dependencies installed, the 11 LinkedIn Job Postings CSV files in the expected directory structure, and Phase 1 (Fase1_Exploracion_Inicial.ipynb) running successfully — printing the dataset shape, a full null-value audit across 31 columns, and key descriptive statistics for 123,849 job postings.

Prerequisites

Before you begin, make sure you have the following:

Python 3.11+ — the notebooks declare kernel version 3.11.0. Other 3.x releases may work but are untested.
Jupyter Notebook or JupyterLab — for local execution. Alternatively, a free Google Colab account removes the need for any local setup.
Kaggle account — required to download the LinkedIn Job Postings dataset (free registration at kaggle.com).
Git — to clone the repository.

Clone the Repository

Open a terminal and run:

git clone https://github.com/MajoRodri/HRIA.git && cd HRIA

This creates an HRIA/ directory containing the five notebooks, a charts_phase4/ folder with pre-rendered visualisations, and a docs/ folder with the HTML bias report.

HRIA/
├── Fase1_Exploracion_Inicial.ipynb
├── Fase2_Limpieza_Preparacion.ipynb
├── Fase3_Analisis_Estadistico_Sesgos.ipynb
├── Fase3_1_Informe_de_Sesgos.ipynb
├── Phase4_Visualization.ipynb
├── charts_phase4/
└── docs/

Install Python Dependencies

Install the required packages into your active Python 3.11 environment:

pip install pandas numpy scipy seaborn jupyter

Verify the key library versions after installation:

import pandas as pd
import numpy as np

print(f"Pandas  version: {pd.__version__}")   # expected: 2.2.2
print(f"NumPy   version: {np.__version__}")    # expected: 2.0.2

Download the LinkedIn Job Postings Dataset from Kaggle

The dataset is the LinkedIn Job Postings collection by Arsh Kon, available at: kaggle.com/datasets/arshkon/linkedin-job-postingsOption A — Kaggle CLI (recommended):

# Install the Kaggle CLI if you don't have it
pip install kaggle

# Place your kaggle.json API token at ~/.kaggle/kaggle.json, then:
kaggle datasets download -d arshkon/linkedin-job-postings --unzip

Option B — Manual download:Log in to Kaggle, navigate to the dataset page above, click Download, and unzip the resulting archive locally.

Both options produce the same archive/ directory containing all 11 CSV files. The Kaggle CLI is faster for automation or Colab environments where you can upload kaggle.json to /root/.kaggle/.

Place the CSV Files in the Expected Directory Structure

Move or unzip the downloaded files so that archive/ sits inside your HRIA/ project folder. The notebooks resolve all paths relative to this location.

HRIA/
└── archive/
    ├── postings.csv                  ← main table (123,849 rows × 27 columns)
    ├── companies/
    │   ├── companies.csv
    │   ├── company_industries.csv
    │   ├── company_specialities.csv
    │   └── employee_counts.csv
    ├── jobs/
    │   ├── benefits.csv
    │   ├── job_industries.csv
    │   ├── job_skills.csv
    │   └── salaries.csv
    └── mappings/
        ├── industries.csv
        └── skills.csv

If you are using Google Colab, upload the entire archive/ folder to your Google Drive (e.g. at MyDrive/archive/). The first code cell in each notebook mounts Drive automatically and sets the working directory to /content/drive/MyDrive/archive.

Verify that postings.csv is at the root of archive/ and not nested one level deeper inside a subdirectory. A common mistake after unzipping is ending up with archive/archive/postings.csv.

Open and Run Phase 1

Launch Jupyter and open the first notebook:

jupyter notebook Fase1_Exploracion_Inicial.ipynb

Then run all cells in order (Cell → Run All). Phase 1 will:

Detect the runtime (Colab or local) and configure paths accordingly
Load all 11 CSV files into separate DataFrames
Print the shape of the main postings table
Display dtypes and non-null counts for all 31 columns
Generate a ranked null-value summary table
Produce descriptive statistics (df.describe()) for numeric columns

What to expect after a successful run:

# Dataset shape
(123849, 31)

# Null-value summary (top columns by missing rate)
Column                        Nulls     Missing %
─────────────────────────────────────────────────
closed_time                  122,776      99.1 %
skills_desc                  121,410      98.0 %
med_salary                   117,569      94.9 %
remote_allowed               108,603      87.7 %
applies                      100,529      81.2 %
min_salary                    94,056      75.9 %
max_salary                    94,056      75.9 %
pay_period / currency /
  compensation_type /
  normalized_salary            87,776      70.9 %
posting_domain                39,968      32.3 %
application_url               36,665      29.6 %
formatted_experience_level    29,409      23.7 %

Key takeaways from Phase 1:

med_salary is ~95 % null — effectively unusable without imputation
min_salary / max_salary are available for only ~24 % of postings
formatted_experience_level is missing for ~24 % of rows, limiting experience-salary analysis
Fields like job_id, title, location, and work_type are 100 % complete

(Optional) Run on Google Colab

Each notebook includes an Open in Colab badge at the top. Click the badge to open the notebook directly in Colab without any local setup.Once open in Colab:

Mount Google Drive — the first code cell handles this automatically when it detects the Colab environment:

import sys
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    from google.colab import drive
    drive.mount('/content/drive')
    import os
    RUTA_DRIVE = '/content/drive/MyDrive/archive'
    os.chdir(RUTA_DRIVE)
    import subprocess
    subprocess.run(['pip', 'install', '-q', 'seaborn', 'scipy'])

Confirm that your archive/ folder is at MyDrive/archive in Drive (matching the RUTA_DRIVE variable above). Adjust the path in the cell if your folder is named differently.
Run all cells with Runtime → Run all.

Google Colab sessions are ephemeral — Drive must be remounted each session. The install cell re-installs seaborn and scipy on every run, which takes ~30 seconds.

Running the Full Pipeline

Notebooks must be executed in order: Fase1 → Fase2 → Fase3 → Fase3_1 → Phase4. Each phase reads CSV files written by the previous one. Specifically, Phases 3, 3.1, and 4 all require data_maestro_completo.csv, data_roles_completo.csv, and data_roles_salario.csv — files that only exist after Phase 2 has completed successfully. Skipping ahead will raise a FileNotFoundError.

Order	Notebook	Reads	Writes
1	`Fase1_Exploracion_Inicial.ipynb`	11 raw CSVs	—
2	`Fase2_Limpieza_Preparacion.ipynb`	11 raw CSVs	`data_maestro_completo.csv`, `data_roles_completo.csv`, `data_roles_salario.csv`
3	`Fase3_Analisis_Estadistico_Sesgos.ipynb`	`data_roles_completo.csv`, `data_roles_salario.csv`	—
4	`Fase3_1_Informe_de_Sesgos.ipynb`	`data_roles_completo.csv`, `data_roles_salario.csv`, `data_maestro_completo.csv`	—
5	`Phase4_Visualization.ipynb`	`data_roles_completo.csv`, `data_roles_salario.csv`, `data_maestro_completo.csv`	11 chart PNGs in `charts_phase4/`

Fase3_1_Informe_de_Sesgos.ipynb is approximately 8.5 MB on disk due to embedded plot outputs stored inside the notebook JSON. It may take 15–30 seconds to open in Jupyter or VS Code, and GitHub will warn that it is too large to render in the browser. If the notebook appears blank or unresponsive, use File → Close and Halt, wait a moment, and reopen — or clear all outputs before opening (jupyter nbconvert --ClearOutputPreprocessor.enabled=True --to notebook Fase3_1_Informe_de_Sesgos.ipynb).

Next Steps

Once Phase 1 is running, explore the rest of the documentation to understand what each subsequent phase does before you execute it:

Dataset Overview — detailed schema for all 11 CSV files
Phase 1 — Initial Exploration — full walkthrough of the exploration notebook
Phase 2 — Cleaning & Preparation — how salary normalisation and master joins are constructed
Bias Analysis Overview — the eight structural biases uncovered in Phase 3.1

Overview

Dataset

Analysis Phases

Bias Analysis

Findings & Recommendations

Prerequisites

Running the Full Pipeline

Next Steps

Build docs developers (and LLMs) love

Overview

Dataset

Analysis Phases

Bias Analysis

Findings & Recommendations

Documentation Index

​Prerequisites

​Running the Full Pipeline

​Next Steps

Build docs developers (and LLMs) love

Prerequisites

Running the Full Pipeline

Next Steps