Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/MajoRodri/HRIA/llms.txt

Use this file to discover all available pages before exploring further.

By the end of this guide you will have the HRIA repository on your machine, all Python dependencies installed, the 11 LinkedIn Job Postings CSV files in the expected directory structure, and Phase 1 (Fase1_Exploracion_Inicial.ipynb) running successfully — printing the dataset shape, a full null-value audit across 31 columns, and key descriptive statistics for 123,849 job postings.

Prerequisites

Before you begin, make sure you have the following:
  • Python 3.11+ — the notebooks declare kernel version 3.11.0. Other 3.x releases may work but are untested.
  • Jupyter Notebook or JupyterLab — for local execution. Alternatively, a free Google Colab account removes the need for any local setup.
  • Kaggle account — required to download the LinkedIn Job Postings dataset (free registration at kaggle.com).
  • Git — to clone the repository.

1

Clone the Repository

Open a terminal and run:
git clone https://github.com/MajoRodri/HRIA.git && cd HRIA
This creates an HRIA/ directory containing the five notebooks, a charts_phase4/ folder with pre-rendered visualisations, and a docs/ folder with the HTML bias report.
HRIA/
├── Fase1_Exploracion_Inicial.ipynb
├── Fase2_Limpieza_Preparacion.ipynb
├── Fase3_Analisis_Estadistico_Sesgos.ipynb
├── Fase3_1_Informe_de_Sesgos.ipynb
├── Phase4_Visualization.ipynb
├── charts_phase4/
└── docs/
2

Install Python Dependencies

Install the required packages into your active Python 3.11 environment:
pip install pandas numpy scipy seaborn jupyter
Verify the key library versions after installation:
import pandas as pd
import numpy as np

print(f"Pandas  version: {pd.__version__}")   # expected: 2.2.2
print(f"NumPy   version: {np.__version__}")    # expected: 2.0.2
3

Download the LinkedIn Job Postings Dataset from Kaggle

The dataset is the LinkedIn Job Postings collection by Arsh Kon, available at: kaggle.com/datasets/arshkon/linkedin-job-postingsOption A — Kaggle CLI (recommended):
# Install the Kaggle CLI if you don't have it
pip install kaggle

# Place your kaggle.json API token at ~/.kaggle/kaggle.json, then:
kaggle datasets download -d arshkon/linkedin-job-postings --unzip
Option B — Manual download:Log in to Kaggle, navigate to the dataset page above, click Download, and unzip the resulting archive locally.
Both options produce the same archive/ directory containing all 11 CSV files. The Kaggle CLI is faster for automation or Colab environments where you can upload kaggle.json to /root/.kaggle/.
4

Place the CSV Files in the Expected Directory Structure

Move or unzip the downloaded files so that archive/ sits inside your HRIA/ project folder. The notebooks resolve all paths relative to this location.
HRIA/
└── archive/
    ├── postings.csv                  ← main table (123,849 rows × 27 columns)
    ├── companies/
    │   ├── companies.csv
    │   ├── company_industries.csv
    │   ├── company_specialities.csv
    │   └── employee_counts.csv
    ├── jobs/
    │   ├── benefits.csv
    │   ├── job_industries.csv
    │   ├── job_skills.csv
    │   └── salaries.csv
    └── mappings/
        ├── industries.csv
        └── skills.csv
If you are using Google Colab, upload the entire archive/ folder to your Google Drive (e.g. at MyDrive/archive/). The first code cell in each notebook mounts Drive automatically and sets the working directory to /content/drive/MyDrive/archive.
Verify that postings.csv is at the root of archive/ and not nested one level deeper inside a subdirectory. A common mistake after unzipping is ending up with archive/archive/postings.csv.
5

Open and Run Phase 1

Launch Jupyter and open the first notebook:
jupyter notebook Fase1_Exploracion_Inicial.ipynb
Then run all cells in order (Cell → Run All). Phase 1 will:
  1. Detect the runtime (Colab or local) and configure paths accordingly
  2. Load all 11 CSV files into separate DataFrames
  3. Print the shape of the main postings table
  4. Display dtypes and non-null counts for all 31 columns
  5. Generate a ranked null-value summary table
  6. Produce descriptive statistics (df.describe()) for numeric columns
What to expect after a successful run:
# Dataset shape
(123849, 31)

# Null-value summary (top columns by missing rate)
Column                        Nulls     Missing %
─────────────────────────────────────────────────
closed_time                  122,776      99.1 %
skills_desc                  121,410      98.0 %
med_salary                   117,569      94.9 %
remote_allowed               108,603      87.7 %
applies                      100,529      81.2 %
min_salary                    94,056      75.9 %
max_salary                    94,056      75.9 %
pay_period / currency /
  compensation_type /
  normalized_salary            87,776      70.9 %
posting_domain                39,968      32.3 %
application_url               36,665      29.6 %
formatted_experience_level    29,409      23.7 %
Key takeaways from Phase 1:
  • med_salary is ~95 % null — effectively unusable without imputation
  • min_salary / max_salary are available for only ~24 % of postings
  • formatted_experience_level is missing for ~24 % of rows, limiting experience-salary analysis
  • Fields like job_id, title, location, and work_type are 100 % complete
6

(Optional) Run on Google Colab

Each notebook includes an Open in Colab badge at the top. Click the badge to open the notebook directly in Colab without any local setup.Once open in Colab:
  1. Mount Google Drive — the first code cell handles this automatically when it detects the Colab environment:
import sys
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    from google.colab import drive
    drive.mount('/content/drive')
    import os
    RUTA_DRIVE = '/content/drive/MyDrive/archive'
    os.chdir(RUTA_DRIVE)
    import subprocess
    subprocess.run(['pip', 'install', '-q', 'seaborn', 'scipy'])
  1. Confirm that your archive/ folder is at MyDrive/archive in Drive (matching the RUTA_DRIVE variable above). Adjust the path in the cell if your folder is named differently.
  2. Run all cells with Runtime → Run all.
Google Colab sessions are ephemeral — Drive must be remounted each session. The install cell re-installs seaborn and scipy on every run, which takes ~30 seconds.

Running the Full Pipeline

Notebooks must be executed in order: Fase1Fase2Fase3Fase3_1Phase4. Each phase reads CSV files written by the previous one. Specifically, Phases 3, 3.1, and 4 all require data_maestro_completo.csv, data_roles_completo.csv, and data_roles_salario.csv — files that only exist after Phase 2 has completed successfully. Skipping ahead will raise a FileNotFoundError.
OrderNotebookReadsWrites
1Fase1_Exploracion_Inicial.ipynb11 raw CSVs
2Fase2_Limpieza_Preparacion.ipynb11 raw CSVsdata_maestro_completo.csv, data_roles_completo.csv, data_roles_salario.csv
3Fase3_Analisis_Estadistico_Sesgos.ipynbdata_roles_completo.csv, data_roles_salario.csv
4Fase3_1_Informe_de_Sesgos.ipynbdata_roles_completo.csv, data_roles_salario.csv, data_maestro_completo.csv
5Phase4_Visualization.ipynbdata_roles_completo.csv, data_roles_salario.csv, data_maestro_completo.csv11 chart PNGs in charts_phase4/
Fase3_1_Informe_de_Sesgos.ipynb is approximately 8.5 MB on disk due to embedded plot outputs stored inside the notebook JSON. It may take 15–30 seconds to open in Jupyter or VS Code, and GitHub will warn that it is too large to render in the browser. If the notebook appears blank or unresponsive, use File → Close and Halt, wait a moment, and reopen — or clear all outputs before opening (jupyter nbconvert --ClearOutputPreprocessor.enabled=True --to notebook Fase3_1_Informe_de_Sesgos.ipynb).

Next Steps

Once Phase 1 is running, explore the rest of the documentation to understand what each subsequent phase does before you execute it:

Build docs developers (and LLMs) love