MNAR Salary Bias: Why 76% of Postings Hide Salary Data

Missing data is not all created equal. Statisticians distinguish three categories: MCAR (Missing Completely At Random — missingness is pure chance), MAR (Missing At Random — missingness depends on observed variables), and MNAR (Missing Not At Random — missingness depends on the value of the missing variable itself). The salary fields in this dataset are MNAR. The probability that a posting omits salary is directly related to how low that salary is — companies with below-market compensation have the strongest incentive to withhold pay information and the most to lose from transparency. This is not a data collection failure; it is a structural feature of how labor markets operate under information asymmetry.

The Numbers

The scale of salary missingness in this dataset is substantial:

75–81% of all postings are missing min_salary and max_salary values
Only 6,108 of 19,725 data-role postings (roles tagged as data-relevant) contain clean, usable salary data
This means salary analyses in HRIA are grounded in 32.2% of the data-role population at best
The remaining 67.8% — the majority — have deliberately or incidentally concealed their compensation

The 24% disclosure rate is not uniform across the dataset. Disclosure rates vary significantly by company size, industry, and geography. Treat the “salary-disclosed subset” as a distinct population from the full posting universe.

The MNAR Hypothesis

If salary missingness were random (MCAR), we would expect disclosed salaries to look like a representative sample of the full compensation distribution. They do not. The MNAR mechanism operates as follows:

Below-market employers know that disclosing salary early in the funnel will reduce application volume from qualified candidates
Withholding pay forces candidates further into the interview process before learning compensation — reducing early drop-off
Above-market employers face no such incentive and frequently disclose salary as a competitive differentiator to attract candidates faster
The result: disclosed salaries are systematically skewed toward higher-paying roles and companies

This would not occur if missingness were random. The pattern is detectable: if salary disclosure rates correlate with company size, industry prestige, or geographic cost-of-living, the MNAR mechanism is empirically confirmed.

Quantified Impact

Metric	Value
Data-role postings with salary	6,108 / 19,725 (32.2%)
Missing `min_salary`	~75–81% of all postings
Missing `max_salary`	~75–81% of all postings
Analyses that rely on salary	Represent ≤ 32.2% of data-role population
Risk of overstated salary estimates	High — disclosed salaries skew toward larger firms

Any conclusion drawn from salary figures in this dataset applies to a subpopulation of disclosing companies — which are disproportionately larger, better-funded, and higher-paying. Conclusions do not generalize to the 67.8% of postings with hidden salaries.

Implications for Modeling

Do not use salary as a primary feature in any model trained on the full dataset. The disclosed-salary subset is not a representative sample of the salary distribution.

Specific modeling risks introduced by MNAR salary data:

Salary prediction models trained on disclosing companies will learn patterns from larger, higher-paying firms and will systematically overestimate salaries for smaller companies or cost-sensitive industries
Average salary estimates reported without disclosure-rate adjustment will be overstated — the true population median is lower than the observed disclosed median
Geographic salary comparisons are unreliable for regions where pay disclosure is culturally less common (e.g., parts of Europe and Asia), as those regions will be underrepresented even within the disclosed subset
Clustering and segmentation models that include salary as a feature will form clusters that reflect “disclosing vs non-disclosing” behavior as much as actual salary variation

Detection Approach

The MNAR pattern can be confirmed empirically by testing whether disclosure behavior correlates with observable proxies for salary competitiveness:

By company size — do larger companies (which typically pay more) disclose at higher rates?
By industry — do prestige industries (finance, big tech) disclose more than commodity sectors?
By location — do high cost-of-living markets (San Francisco, New York) disclose at higher rates than lower-cost markets?

If disclosure correlates positively with any of these proxies for salary level, the MNAR mechanism is confirmed. The absence of such correlation would support a MAR or MCAR classification.

# Check if salary disclosure correlates with company size
df['has_salary'] = df['salary_annual'].notna().astype(int)
disclosure_by_size = df.groupby('company_size')['has_salary'].mean()
print(disclosure_by_size.sort_values(ascending=False))
# If large companies disclose more → MNAR pattern confirmed

Mitigation Strategies

The most robust mitigation is transparency: always clearly label which analyses apply to the salary-disclosed subset and what percentage of the full population that represents.

Strategy	Description	When to Use
Scope labeling	Explicitly label all salary analyses as “salary-disclosed subset (32.2%)“	All salary analyses
Secondary signal	Use salary as a supporting signal, not a primary outcome variable	Exploratory analysis
Disclosure-rate weighting	Weight salary observations inversely by disclosure probability	Statistical modeling
Multiple imputation	Impute missing salaries using observed predictors (company size, location, title)	Predictive models requiring salary
External benchmarking	Supplement with government wage surveys or third-party compensation databases	Spain-specific analysis

Multiple imputation is technically valid for MNAR data only when the imputation model explicitly accounts for the MNAR mechanism — standard imputation methods assume MAR and will reproduce the same upward bias if not corrected.

Overview

Dataset

Analysis Phases

Bias Analysis

Findings & Recommendations

The Numbers

The MNAR Hypothesis

Quantified Impact

Implications for Modeling

Detection Approach

Mitigation Strategies

Build docs developers (and LLMs) love

Overview

Dataset

Analysis Phases

Bias Analysis

Findings & Recommendations

Documentation Index

​The Numbers

​The MNAR Hypothesis

​Quantified Impact

​Implications for Modeling

​Detection Approach

​Mitigation Strategies

Build docs developers (and LLMs) love

The Numbers

The MNAR Hypothesis

Quantified Impact

Implications for Modeling

Detection Approach

Mitigation Strategies