Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/MajoRodri/HRIA/llms.txt

Use this file to discover all available pages before exploring further.

Phase 1 is the foundation of the entire HRIA pipeline. Before any transformation or modeling can take place, this phase loads all 11 raw LinkedIn CSV files, inspects their structure, and documents every dimension that will inform downstream cleaning and analysis decisions. By establishing baseline counts, cardinality, and missing-value rates here, the team ensures that every subsequent step is grounded in empirical evidence rather than assumptions about data quality.

Notebook

Fase1_Exploracion_Inicial.ipynb

Libraries

LibraryVersionPurpose
pandas2.2.2DataFrame loading, profiling, and aggregation
NumPy2.0.2Numerical operations and null checks
warningsstdlibSuppress non-critical deprecation warnings

Dataset Files

All 11 CSV files are sourced from the archive/ directory. Together they form a relational schema with one central fact table (postings.csv) and ten dimension/bridge tables.
FileRowsColumnsDescription
postings.csv123,84931Core job postings — the fact table
companies.csv24,42814Company profiles
company_industries.csv36,8532Company ↔ industry mapping
company_specialities.csv34,9042Company ↔ speciality mapping
job_industries.csv148,6682Job posting ↔ industry mapping
job_skills.csv993,4002Job posting ↔ skill mapping
benefits.csv254,1264Benefits offered per posting
salaries.csv9,0088Structured salary ranges
skills.csv35,9162Skill ID → skill name lookup
industries.csv1492Industry ID → industry name lookup
employee_counts.csv54,2625Company headcount snapshots

Key Profiling Performed

For every file, Phase 1 applies the same systematic profiling sequence:
  • .shape — confirm row and column counts before any joins
  • .dtypes — identify numeric vs. object fields and flag mistyped columns (e.g., salary stored as string)
  • .head() — visual spot-check of raw values and formatting inconsistencies
  • .isnull().sum() / .isnull().mean() — absolute and percentage missing-value rates per column
  • .nunique() — cardinality analysis to distinguish identifier columns from categorical ones
  • .value_counts() — frequency distributions for categorical fields such as work_type, formatted_experience_level, and formatted_work_type

Critical Findings

Volume and Breadth

  • 123,849 total job postings spanning 24,428 unique companies
  • 72,521 unique job titles — the most frequent is Sales Manager with 673 occurrences, confirming extreme long-tail distribution in job title naming

Work Type and Experience

  • 80 % of postings are Full-time (98,814 of 123,849 rows), meaning Part-time and Contract segments are structurally underrepresented in any aggregate analysis
  • Mid-Senior level is the dominant experience band with 41,489 occurrences, followed by Associate (23,904) and Entry Level (14,157)

Salary Landscape

  • Median normalized salary ≈ 81,500/year;themeanissignificantlyinflatedbyextremeoutliersreaching81,500/year**; the mean is significantly inflated by extreme outliers reaching **535M+, confirming the need for IQR-based outlier removal in Phase 2
  • Salary data is sparse: only ~7 % of postings carry any salary information, and the pattern of missingness is non-random

Missing-Value Pattern

The salary columns (min_salary, max_salary, med_salary) exhibit the highest null rates in the dataset. Crucially, the companies with missing salary data are not a random sample — large, well-known employers tend to disclose salaries more readily, while smaller or less competitive employers appear to omit this information strategically.
Phase 1 is where the Missing Not At Random (MNAR) hypothesis is first observed: companies with less competitive compensation appear to systematically omit salary data. This pattern is documented here as an empirical observation and is formally tested and quantified in Phase 3 — Statistical Analysis.

Initial Load — Sample Code

import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

# Load the main fact table
postings = pd.read_csv('archive/postings.csv')
print(f"postings shape: {postings.shape}")  # (123849, 31)

# Profile missing values
null_pct = postings.isnull().mean().sort_values(ascending=False)
print(null_pct[null_pct > 0].to_string())
Running the snippet above produces a ranked list of columns by missingness rate. Salary-related fields (med_salary, min_salary, max_salary) consistently appear at the top of this list, reinforcing the MNAR hypothesis even before any statistical test is applied.

Next Step

With the raw schema fully understood and baseline statistics documented, Phase 2 merges all 11 files into a single master DataFrame, normalizes salary figures to annual USD, and removes outliers in preparation for statistical analysis.

Phase 2: Data Cleaning and Master Dataset Preparation

Merge all 11 CSVs, normalize salaries, filter to data roles, and produce three clean output files.

Build docs developers (and LLMs) love