By the end of this guide you will have the HRIA repository on your machine, all Python dependencies installed, the 11 LinkedIn Job Postings CSV files in the expected directory structure, and Phase 1 (Documentation Index
Fetch the complete documentation index at: https://mintlify.com/MajoRodri/HRIA/llms.txt
Use this file to discover all available pages before exploring further.
Fase1_Exploracion_Inicial.ipynb) running successfully — printing the dataset shape, a full null-value audit across 31 columns, and key descriptive statistics for 123,849 job postings.
Prerequisites
Before you begin, make sure you have the following:- Python 3.11+ — the notebooks declare kernel version
3.11.0. Other 3.x releases may work but are untested. - Jupyter Notebook or JupyterLab — for local execution. Alternatively, a free Google Colab account removes the need for any local setup.
- Kaggle account — required to download the LinkedIn Job Postings dataset (free registration at kaggle.com).
- Git — to clone the repository.
Clone the Repository
Open a terminal and run:This creates an
HRIA/ directory containing the five notebooks, a charts_phase4/ folder with pre-rendered visualisations, and a docs/ folder with the HTML bias report.Install Python Dependencies
Install the required packages into your active Python 3.11 environment:Verify the key library versions after installation:
Download the LinkedIn Job Postings Dataset from Kaggle
The dataset is the LinkedIn Job Postings collection by Arsh Kon, available at:
kaggle.com/datasets/arshkon/linkedin-job-postingsOption A — Kaggle CLI (recommended):Option B — Manual download:Log in to Kaggle, navigate to the dataset page above, click Download, and unzip the resulting archive locally.
Both options produce the same
archive/ directory containing all 11 CSV files. The Kaggle CLI is faster for automation or Colab environments where you can upload kaggle.json to /root/.kaggle/.Place the CSV Files in the Expected Directory Structure
Move or unzip the downloaded files so that If you are using Google Colab, upload the entire
archive/ sits inside your HRIA/ project folder. The notebooks resolve all paths relative to this location.archive/ folder to your Google Drive (e.g. at MyDrive/archive/). The first code cell in each notebook mounts Drive automatically and sets the working directory to /content/drive/MyDrive/archive.Open and Run Phase 1
Launch Jupyter and open the first notebook:Then run all cells in order (Cell → Run All). Phase 1 will:Key takeaways from Phase 1:
- Detect the runtime (Colab or local) and configure paths accordingly
- Load all 11 CSV files into separate DataFrames
- Print the shape of the main
postingstable - Display dtypes and non-null counts for all 31 columns
- Generate a ranked null-value summary table
- Produce descriptive statistics (
df.describe()) for numeric columns
med_salaryis ~95 % null — effectively unusable without imputationmin_salary/max_salaryare available for only ~24 % of postingsformatted_experience_levelis missing for ~24 % of rows, limiting experience-salary analysis- Fields like
job_id,title,location, andwork_typeare 100 % complete
(Optional) Run on Google Colab
Each notebook includes an Open in Colab badge at the top. Click the badge to open the notebook directly in Colab without any local setup.Once open in Colab:
- Mount Google Drive — the first code cell handles this automatically when it detects the Colab environment:
-
Confirm that your
archive/folder is atMyDrive/archivein Drive (matching theRUTA_DRIVEvariable above). Adjust the path in the cell if your folder is named differently. - Run all cells with Runtime → Run all.
Google Colab sessions are ephemeral — Drive must be remounted each session. The install cell re-installs
seaborn and scipy on every run, which takes ~30 seconds.Running the Full Pipeline
| Order | Notebook | Reads | Writes |
|---|---|---|---|
| 1 | Fase1_Exploracion_Inicial.ipynb | 11 raw CSVs | — |
| 2 | Fase2_Limpieza_Preparacion.ipynb | 11 raw CSVs | data_maestro_completo.csv, data_roles_completo.csv, data_roles_salario.csv |
| 3 | Fase3_Analisis_Estadistico_Sesgos.ipynb | data_roles_completo.csv, data_roles_salario.csv | — |
| 4 | Fase3_1_Informe_de_Sesgos.ipynb | data_roles_completo.csv, data_roles_salario.csv, data_maestro_completo.csv | — |
| 5 | Phase4_Visualization.ipynb | data_roles_completo.csv, data_roles_salario.csv, data_maestro_completo.csv | 11 chart PNGs in charts_phase4/ |
Next Steps
Once Phase 1 is running, explore the rest of the documentation to understand what each subsequent phase does before you execute it:- Dataset Overview — detailed schema for all 11 CSV files
- Phase 1 — Initial Exploration — full walkthrough of the exploration notebook
- Phase 2 — Cleaning & Preparation — how salary normalisation and master joins are constructed
- Bias Analysis Overview — the eight structural biases uncovered in Phase 3.1