Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/MajoRodri/HRIA/llms.txt

Use this file to discover all available pages before exploring further.

Bias analysis is the practice of systematically identifying, naming, and quantifying the ways a dataset deviates from a true representation of the population it is meant to describe. In HR analytics, bias is not merely a statistical inconvenience — it directly determines whether salary benchmarks are fair, whether skill demand signals are trustworthy, and whether hiring recommendation models reinforce or correct existing inequalities. HRIA formally identifies 8 structural biases present in the 124K LinkedIn Job Postings dataset, each with a documented mechanism, a quantified impact, and a recommended mitigation strategy. No analysis in this project should be interpreted without understanding these limitations.

The 8 Structural Biases

#Bias NameTypeAffected Column(s)Impact Level
1MNAR SalaryMissing Not At Randomsalary_annual, min_salary, max_salary🔴 Critical
2GeographicRepresentationlocation, comp_country🟠 High
3SelectionSamplingAll columns🟠 High
4Gender ProxyUndisclosed AttributeRole title (proxy)🟡 Medium
5TemporalTime-basedlisted_time, original_listed_time🟡 Medium
6Skill AggregationGranularityjob_skills_list, skill_abr🟡 Medium
7SurvivorshipSamplingpostings.csv (active only)🟡 Medium
8Applies UndercountingMeasurementapplies🟢 Low–Medium

Why This Matters

Any predictive model trained on this dataset without explicit bias correction will produce results that are misleading at best and actively harmful at worst. Concretely:
  • Salary prediction models trained on the full dataset will be biased toward the 24% of companies that disclose pay — typically larger, higher-paying firms — causing salary estimates to be systematically overstated.
  • Skill demand rankings aggregated at the global level will reflect US market priorities, not Spanish labor market realities — directly misaligning hiring recommendations for DataTalent Solutions S.L.’s clients.
  • Application volume metrics built on the applies column will systematically undercount demand for roles that use external application links, producing a distorted picture of which roles are competitive.
  • Hiring fairness models that use role titles as implicit proxies for gender will encode and amplify historical occupational segregation patterns.
The goal of this bias documentation is not to disqualify the dataset — it remains rich and analytically valuable — but to ensure every downstream conclusion is scoped to the population it actually represents, not the population it is assumed to represent.
Models trained on this dataset without bias mitigation may perpetuate geographic and gender-based salary inequalities. Salary benchmarks derived from this data reflect a US-dominant, high-disclosure subset of the labor market and should not be applied directly to Spanish compensation analysis without explicit geographic filtering and disclosure-rate adjustment.

Individual Bias Pages

MNAR Salary Bias

76% of postings hide salary data — and it’s not random. Less competitive employers strategically omit pay.

Geographic Bias

The dataset is overwhelmingly US-centric. Spanish salary and skill benchmarks require significant adjustment.

Selection Bias

LinkedIn captures only publicly posted roles, excluding referrals, internal promotions, and agency placements.

Gender Proxy Bias

No gender field exists. Role-title proxies are an imperfect and ethically constrained substitute.

Temporal Bias

The dataset’s time window shapes which industries and salary levels appear most common.

Skill Aggregation Bias

35 broad categories flatten Python vs SQL, PyTorch vs TensorFlow, and every nuance in between.

Survivorship Bias

Only active postings are captured. Quickly filled and never-posted jobs are invisible.

Applies Undercounting

The applies column only counts Easy Apply submissions — external-link jobs record zero.

Build docs developers (and LLMs) love