Phase 1: Initial Data Exploration of LinkedIn Postings

Phase 1 is the foundation of the entire HRIA pipeline. Before any transformation or modeling can take place, this phase loads all 11 raw LinkedIn CSV files, inspects their structure, and documents every dimension that will inform downstream cleaning and analysis decisions. By establishing baseline counts, cardinality, and missing-value rates here, the team ensures that every subsequent step is grounded in empirical evidence rather than assumptions about data quality.

Notebook

Fase1_Exploracion_Inicial.ipynb

Libraries

Library	Version	Purpose
`pandas`	2.2.2	DataFrame loading, profiling, and aggregation
`NumPy`	2.0.2	Numerical operations and null checks
`warnings`	stdlib	Suppress non-critical deprecation warnings

Dataset Files

All 11 CSV files are sourced from the archive/ directory. Together they form a relational schema with one central fact table (postings.csv) and ten dimension/bridge tables.

File	Rows	Columns	Description
`postings.csv`	123,849	31	Core job postings — the fact table
`companies.csv`	24,428	14	Company profiles
`company_industries.csv`	36,853	2	Company ↔ industry mapping
`company_specialities.csv`	34,904	2	Company ↔ speciality mapping
`job_industries.csv`	148,668	2	Job posting ↔ industry mapping
`job_skills.csv`	993,400	2	Job posting ↔ skill mapping
`benefits.csv`	254,126	4	Benefits offered per posting
`salaries.csv`	9,008	8	Structured salary ranges
`skills.csv`	35,916	2	Skill ID → skill name lookup
`industries.csv`	149	2	Industry ID → industry name lookup
`employee_counts.csv`	54,262	5	Company headcount snapshots

Key Profiling Performed

For every file, Phase 1 applies the same systematic profiling sequence:

.shape — confirm row and column counts before any joins
.dtypes — identify numeric vs. object fields and flag mistyped columns (e.g., salary stored as string)
.head() — visual spot-check of raw values and formatting inconsistencies
.isnull().sum() / .isnull().mean() — absolute and percentage missing-value rates per column
.nunique() — cardinality analysis to distinguish identifier columns from categorical ones
.value_counts() — frequency distributions for categorical fields such as work_type, formatted_experience_level, and formatted_work_type

Critical Findings

Volume and Breadth

123,849 total job postings spanning 24,428 unique companies
72,521 unique job titles — the most frequent is Sales Manager with 673 occurrences, confirming extreme long-tail distribution in job title naming

Work Type and Experience

80 % of postings are Full-time (98,814 of 123,849 rows), meaning Part-time and Contract segments are structurally underrepresented in any aggregate analysis
Mid-Senior level is the dominant experience band with 41,489 occurrences, followed by Associate (23,904) and Entry Level (14,157)

Salary Landscape

Median normalized salary ≈ $81,500/year**; the mean is significantly inflated by extreme outliers reaching **$ 535M+, confirming the need for IQR-based outlier removal in Phase 2
Salary data is sparse: only ~7 % of postings carry any salary information, and the pattern of missingness is non-random

Missing-Value Pattern

The salary columns (min_salary, max_salary, med_salary) exhibit the highest null rates in the dataset. Crucially, the companies with missing salary data are not a random sample — large, well-known employers tend to disclose salaries more readily, while smaller or less competitive employers appear to omit this information strategically.

Phase 1 is where the Missing Not At Random (MNAR) hypothesis is first observed: companies with less competitive compensation appear to systematically omit salary data. This pattern is documented here as an empirical observation and is formally tested and quantified in Phase 3 — Statistical Analysis.

Initial Load — Sample Code

import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

# Load the main fact table
postings = pd.read_csv('archive/postings.csv')
print(f"postings shape: {postings.shape}")  # (123849, 31)

# Profile missing values
null_pct = postings.isnull().mean().sort_values(ascending=False)
print(null_pct[null_pct > 0].to_string())

Running the snippet above produces a ranked list of columns by missingness rate. Salary-related fields (med_salary, min_salary, max_salary) consistently appear at the top of this list, reinforcing the MNAR hypothesis even before any statistical test is applied.

Next Step

With the raw schema fully understood and baseline statistics documented, Phase 2 merges all 11 files into a single master DataFrame, normalizes salary figures to annual USD, and removes outliers in preparation for statistical analysis.

Phase 2: Data Cleaning and Master Dataset Preparation

Merge all 11 CSVs, normalize salaries, filter to data roles, and produce three clean output files.

Overview

Dataset

Analysis Phases

Bias Analysis

Findings & Recommendations

Notebook

Libraries

Dataset Files

Key Profiling Performed

Critical Findings

Volume and Breadth

Work Type and Experience

Salary Landscape

Missing-Value Pattern

Initial Load — Sample Code

Next Step

Phase 2: Data Cleaning and Master Dataset Preparation

Build docs developers (and LLMs) love

Overview

Dataset

Analysis Phases

Bias Analysis

Findings & Recommendations

Documentation Index

​Notebook

​Libraries

​Dataset Files

​Key Profiling Performed

​Critical Findings

​Volume and Breadth

​Work Type and Experience

​Salary Landscape

​Missing-Value Pattern

​Initial Load — Sample Code

​Next Step

Phase 2: Data Cleaning and Master Dataset Preparation

Build docs developers (and LLMs) love

Notebook

Libraries

Dataset Files

Key Profiling Performed

Critical Findings

Volume and Breadth

Work Type and Experience

Salary Landscape

Missing-Value Pattern

Initial Load — Sample Code

Next Step