Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/MajoRodri/HRIA/llms.txt

Use this file to discover all available pages before exploring further.

One of the most important questions in HR analytics is whether compensation and opportunity are distributed equitably across gender. It is also one of the questions this dataset is structurally least equipped to answer. LinkedIn job postings contain no gender field — not for the hiring company, not for the role, and not for any applicant. Gender is entirely absent as an explicit variable. This absence is not a minor gap; it is the defining limitation for any fairness or equity analysis built on this data. Understanding why gender is missing, what imperfect proxies exist, and why those proxies carry serious ethical risks is essential before any analysis touches on occupational equity.

Why Gender Is Absent

LinkedIn does not capture or expose applicant gender in its job postings data. This is partly a product of:
  • Privacy regulation: EU GDPR and US equal employment opportunity law restrict the collection and use of protected attributes in hiring contexts
  • Platform design: LinkedIn’s job posting product is designed for employers to describe roles, not to capture demographic data about applicants
  • API and data licensing constraints: even where LinkedIn collects demographic signals internally, this information is not exposed in dataset exports or through the API that produced this dataset
The result is a dataset of 124K job postings in which the single most important protected attribute for salary equity analysis is completely unobservable.

The Proxy Approach

In the absence of direct gender data, some analysts use occupational gender coding as a proxy: mapping job titles to historically male-coded or female-coded occupations based on workforce composition data from labor surveys. This approach, explored in the Phase 3 and Phase 4 analyses of HRIA, works as follows: Female-coded role proxies (occupations with historically high female workforce representation):
  • HR Coordinator, HR Generalist, HR Manager
  • Administrative Assistant, Executive Assistant
  • Recruiter, Talent Acquisition Specialist
  • Marketing Coordinator, Content Writer
  • Nurse, Healthcare Coordinator
Male-coded role proxies (occupations with historically high male workforce representation):
  • Software Engineer, Data Scientist, Data Engineer
  • DevOps Engineer, Machine Learning Engineer
  • Financial Analyst, Investment Banker
  • Operations Manager, Supply Chain Manager
By grouping postings into these categories, salary distributions can be compared across occupational gender lines — a proxy for a gender pay gap analysis.

Limitations of Proxy Inference

Gender proxies derived from job titles must never be used as features in hiring, compensation, or candidate-screening models. Doing so constitutes algorithmic discrimination under EU Directive 2023/970 on pay transparency and equivalent anti-discrimination frameworks.
The proxy approach is analytically useful only for descriptive occupational analysis and carries critical limitations:
  1. Proxies perpetuate stereotypes: labeling “HR Coordinator” as female-coded encodes existing occupational segregation as a fact of nature rather than a historical artifact of discrimination. Using this proxy in a model amplifies rather than corrects the underlying inequality.
  2. Proxies do not capture individual gender: a job title tells you the historical gender composition of an occupation, not the gender of the person who holds or applies for a specific role. An individual male HR Coordinator is misclassified; a non-binary Software Engineer is erased entirely.
  3. Occupational gender coding shifts over time: the proportion of women in data science has grown significantly in recent years. Proxies calibrated on historical data become stale and will produce incorrect classifications for recently integrated occupations.
  4. Non-binary and gender-diverse identities are invisible: even a perfect proxy system that correctly classified male and female occupations would entirely fail to represent the growing share of the workforce that identifies outside the binary.
  5. Interaction effects are lost: salary disparities at the intersection of gender and race, gender and disability, or gender and immigration status cannot be detected from title-based gender proxies alone.

Ethical Implications for Salary Gap Analysis

A direct gender pay gap analysis — comparing salary distributions for male-identified vs female-identified workers — cannot be performed from this dataset alone. Performing it using title-based proxies would produce a figure that is:
  • Measuring occupational segregation (which roles pay differently), not individual-level pay discrimination
  • Potentially actionable as exploratory evidence but not as a compliance or auditing conclusion
  • Liable to misinterpretation if presented to clients as a “gender pay gap” figure
Any robust gender pay gap analysis for DataTalent Solutions’ Spanish clients requires pairing LinkedIn data with gender-disaggregated labor market surveys — such as the INE Encuesta de Estructura Salarial, Eurostat gender pay gap statistics, or the Spanish Registro Retributivo data required under Real Decreto 902/2020.
Use CaseRecommended Data Source
Descriptive occupational gender analysisLinkedIn data + occupational gender coding (proxy only, clearly labeled)
Gender pay gap quantificationINE Encuesta de Estructura Salarial, Eurostat
Pay equity audit for a specific companyInternal HR data with self-reported gender, paired with role and compensation data
Trend analysis (gender in tech)LinkedIn + Stack Overflow Developer Survey + Eurostat STEM data
When presenting occupational gender analysis to clients, always label it explicitly as “occupational gender composition analysis” rather than “gender pay gap analysis.” The distinction matters legally and ethically.

Build docs developers (and LLMs) love