Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/MajoRodri/HRIA/llms.txt

Use this file to discover all available pages before exploring further.

Missing data is not all created equal. Statisticians distinguish three categories: MCAR (Missing Completely At Random — missingness is pure chance), MAR (Missing At Random — missingness depends on observed variables), and MNAR (Missing Not At Random — missingness depends on the value of the missing variable itself). The salary fields in this dataset are MNAR. The probability that a posting omits salary is directly related to how low that salary is — companies with below-market compensation have the strongest incentive to withhold pay information and the most to lose from transparency. This is not a data collection failure; it is a structural feature of how labor markets operate under information asymmetry.

The Numbers

The scale of salary missingness in this dataset is substantial:
  • 75–81% of all postings are missing min_salary and max_salary values
  • Only 6,108 of 19,725 data-role postings (roles tagged as data-relevant) contain clean, usable salary data
  • This means salary analyses in HRIA are grounded in 32.2% of the data-role population at best
  • The remaining 67.8% — the majority — have deliberately or incidentally concealed their compensation
The 24% disclosure rate is not uniform across the dataset. Disclosure rates vary significantly by company size, industry, and geography. Treat the “salary-disclosed subset” as a distinct population from the full posting universe.

The MNAR Hypothesis

If salary missingness were random (MCAR), we would expect disclosed salaries to look like a representative sample of the full compensation distribution. They do not. The MNAR mechanism operates as follows:
  1. Below-market employers know that disclosing salary early in the funnel will reduce application volume from qualified candidates
  2. Withholding pay forces candidates further into the interview process before learning compensation — reducing early drop-off
  3. Above-market employers face no such incentive and frequently disclose salary as a competitive differentiator to attract candidates faster
  4. The result: disclosed salaries are systematically skewed toward higher-paying roles and companies
This would not occur if missingness were random. The pattern is detectable: if salary disclosure rates correlate with company size, industry prestige, or geographic cost-of-living, the MNAR mechanism is empirically confirmed.

Quantified Impact

MetricValue
Data-role postings with salary6,108 / 19,725 (32.2%)
Missing min_salary~75–81% of all postings
Missing max_salary~75–81% of all postings
Analyses that rely on salaryRepresent ≤ 32.2% of data-role population
Risk of overstated salary estimatesHigh — disclosed salaries skew toward larger firms
Any conclusion drawn from salary figures in this dataset applies to a subpopulation of disclosing companies — which are disproportionately larger, better-funded, and higher-paying. Conclusions do not generalize to the 67.8% of postings with hidden salaries.

Implications for Modeling

Do not use salary as a primary feature in any model trained on the full dataset. The disclosed-salary subset is not a representative sample of the salary distribution.
Specific modeling risks introduced by MNAR salary data:
  • Salary prediction models trained on disclosing companies will learn patterns from larger, higher-paying firms and will systematically overestimate salaries for smaller companies or cost-sensitive industries
  • Average salary estimates reported without disclosure-rate adjustment will be overstated — the true population median is lower than the observed disclosed median
  • Geographic salary comparisons are unreliable for regions where pay disclosure is culturally less common (e.g., parts of Europe and Asia), as those regions will be underrepresented even within the disclosed subset
  • Clustering and segmentation models that include salary as a feature will form clusters that reflect “disclosing vs non-disclosing” behavior as much as actual salary variation

Detection Approach

The MNAR pattern can be confirmed empirically by testing whether disclosure behavior correlates with observable proxies for salary competitiveness:
  1. By company size — do larger companies (which typically pay more) disclose at higher rates?
  2. By industry — do prestige industries (finance, big tech) disclose more than commodity sectors?
  3. By location — do high cost-of-living markets (San Francisco, New York) disclose at higher rates than lower-cost markets?
If disclosure correlates positively with any of these proxies for salary level, the MNAR mechanism is confirmed. The absence of such correlation would support a MAR or MCAR classification.
# Check if salary disclosure correlates with company size
df['has_salary'] = df['salary_annual'].notna().astype(int)
disclosure_by_size = df.groupby('company_size')['has_salary'].mean()
print(disclosure_by_size.sort_values(ascending=False))
# If large companies disclose more → MNAR pattern confirmed

Mitigation Strategies

The most robust mitigation is transparency: always clearly label which analyses apply to the salary-disclosed subset and what percentage of the full population that represents.
StrategyDescriptionWhen to Use
Scope labelingExplicitly label all salary analyses as “salary-disclosed subset (32.2%)“All salary analyses
Secondary signalUse salary as a supporting signal, not a primary outcome variableExploratory analysis
Disclosure-rate weightingWeight salary observations inversely by disclosure probabilityStatistical modeling
Multiple imputationImpute missing salaries using observed predictors (company size, location, title)Predictive models requiring salary
External benchmarkingSupplement with government wage surveys or third-party compensation databasesSpain-specific analysis
Multiple imputation is technically valid for MNAR data only when the imputation model explicitly accounts for the MNAR mechanism — standard imputation methods assume MAR and will reproduce the same upward bias if not corrected.

Build docs developers (and LLMs) love