Missing data is not all created equal. Statisticians distinguish three categories: MCAR (Missing Completely At Random — missingness is pure chance), MAR (Missing At Random — missingness depends on observed variables), and MNAR (Missing Not At Random — missingness depends on the value of the missing variable itself). The salary fields in this dataset are MNAR. The probability that a posting omits salary is directly related to how low that salary is — companies with below-market compensation have the strongest incentive to withhold pay information and the most to lose from transparency. This is not a data collection failure; it is a structural feature of how labor markets operate under information asymmetry.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/MajoRodri/HRIA/llms.txt
Use this file to discover all available pages before exploring further.
The Numbers
The scale of salary missingness in this dataset is substantial:- 75–81% of all postings are missing
min_salaryandmax_salaryvalues - Only 6,108 of 19,725 data-role postings (roles tagged as data-relevant) contain clean, usable salary data
- This means salary analyses in HRIA are grounded in 32.2% of the data-role population at best
- The remaining 67.8% — the majority — have deliberately or incidentally concealed their compensation
The 24% disclosure rate is not uniform across the dataset. Disclosure rates vary significantly by company size, industry, and geography. Treat the “salary-disclosed subset” as a distinct population from the full posting universe.
The MNAR Hypothesis
If salary missingness were random (MCAR), we would expect disclosed salaries to look like a representative sample of the full compensation distribution. They do not. The MNAR mechanism operates as follows:- Below-market employers know that disclosing salary early in the funnel will reduce application volume from qualified candidates
- Withholding pay forces candidates further into the interview process before learning compensation — reducing early drop-off
- Above-market employers face no such incentive and frequently disclose salary as a competitive differentiator to attract candidates faster
- The result: disclosed salaries are systematically skewed toward higher-paying roles and companies
Quantified Impact
| Metric | Value |
|---|---|
| Data-role postings with salary | 6,108 / 19,725 (32.2%) |
Missing min_salary | ~75–81% of all postings |
Missing max_salary | ~75–81% of all postings |
| Analyses that rely on salary | Represent ≤ 32.2% of data-role population |
| Risk of overstated salary estimates | High — disclosed salaries skew toward larger firms |
Implications for Modeling
Specific modeling risks introduced by MNAR salary data:- Salary prediction models trained on disclosing companies will learn patterns from larger, higher-paying firms and will systematically overestimate salaries for smaller companies or cost-sensitive industries
- Average salary estimates reported without disclosure-rate adjustment will be overstated — the true population median is lower than the observed disclosed median
- Geographic salary comparisons are unreliable for regions where pay disclosure is culturally less common (e.g., parts of Europe and Asia), as those regions will be underrepresented even within the disclosed subset
- Clustering and segmentation models that include salary as a feature will form clusters that reflect “disclosing vs non-disclosing” behavior as much as actual salary variation
Detection Approach
The MNAR pattern can be confirmed empirically by testing whether disclosure behavior correlates with observable proxies for salary competitiveness:- By company size — do larger companies (which typically pay more) disclose at higher rates?
- By industry — do prestige industries (finance, big tech) disclose more than commodity sectors?
- By location — do high cost-of-living markets (San Francisco, New York) disclose at higher rates than lower-cost markets?
Mitigation Strategies
| Strategy | Description | When to Use |
|---|---|---|
| Scope labeling | Explicitly label all salary analyses as “salary-disclosed subset (32.2%)“ | All salary analyses |
| Secondary signal | Use salary as a supporting signal, not a primary outcome variable | Exploratory analysis |
| Disclosure-rate weighting | Weight salary observations inversely by disclosure probability | Statistical modeling |
| Multiple imputation | Impute missing salaries using observed predictors (company size, location, title) | Predictive models requiring salary |
| External benchmarking | Supplement with government wage surveys or third-party compensation databases | Spain-specific analysis |