The third notebook in the series (Documentation Index
Fetch the complete documentation index at: https://mintlify.com/HelenDiMo/TinderJob/llms.txt
Use this file to discover all available pages before exploring further.
analisis_03_informe_sesgos.ipynb) is a dedicated bias audit — the analytical counterpart to the descriptive and inferential work in Notebooks 1 and 2. Every dataset carries the fingerprint of how it was collected, and TinderJob’s datasets are no exception. This notebook documents three distinct categories of bias affecting the Tecnoempleo scraped data and the DS Salaries reference dataset, quantifies their impact on analytical reliability, and provides concrete recommendations to prevent flawed data from producing flawed — or actively harmful — business decisions. Transparency about limitations is not a weakness in data analysis; it is the foundation of responsible product development.
Bias Summary Table
The three biases identified span both datasets and affect different analytical dimensions. Their combined effect means that any salary figures or hiring recommendations derived from these datasets must be communicated with explicit uncertainty bounds.| Bias Type | Dataset | Category | Impact |
|---|---|---|---|
| MNAR — Missing Salaries | Tecnoempleo | Missing Not At Random | 80.7% null salaries — salary analysis infeasible |
| Search Term Bias | Tecnoempleo | Selection Bias | 24 fixed terms exclude all unlisted role profiles |
| Geographic Underrepresentation | DS Salaries | Geographic Bias | Spain = 2.3% of records (14/607) — unreliable Spanish stats |
1. MNAR Salary Analysis (Missing Not At Random)
Missing data is not a monolithic problem. Its statistical treatment depends critically on why the data is missing. The three standard classifications are MCAR (Missing Completely At Random), MAR (Missing At Random, conditional on observed variables), and MNAR (Missing Not At Random — where the missingness is correlated with the unobserved value itself). The Tecnoempleo salary field exhibits a clear MNAR pattern: companies are systematically less likely to publish a salary when the salary is either very high (negotiable senior roles) or very low (roles where publishing would deter candidates). The missingness is not random — it is a function of the salary itself, which means it cannot be addressed by standard imputation techniques without introducing systematic bias.Overall Null Rate
80.7% of Tecnoempleo
salario fields are null — less than 1 in 5 postings include any salary information.Analytical Consequence
Direct Spanish job market salary analysis using Tecnoempleo is infeasible. The observed salaries represent a non-representative, self-selected subset of postings.
MNAR bias is the most dangerous of the three missingness types. MAR and MCAR can be addressed with imputation — techniques such as multiple imputation by chained equations (MICE) or KNN imputation. MNAR cannot: any imputation model trained on observed salaries will reproduce the selection bias embedded in those observations. Addressing MNAR requires domain knowledge and auxiliary data sources such as salary surveys, LinkedIn compensation data, or regulatory filings.
2. Scraper Selection Bias
The Tecnoempleo scraper retrieves job listings by submitting a fixed list of search terms to the platform. This design choice — pragmatic from an engineering standpoint — introduces a structural selection bias: only roles that match one of the 24 predefined terms are captured. Any tech role not represented in this list is invisible to the entire analysis pipeline. The 24 search terms used by the scraper are:- Expand the search term list through periodic review against job board trending terms
- Implement keyword-based discovery: scrape category pages rather than individual search queries
- Cross-validate coverage against a secondary source (e.g., LinkedIn job counts by role) to estimate the proportion of the market being captured
3. Geographic Underrepresentation (DS Salaries)
The DS Salaries dataset is widely used as a reference benchmark for data science and ML compensation. However, its geographic composition renders it unreliable for Spanish market analysis. The United States dominates the dataset, and Spain’s representation is statistically negligible.| Country | Records | Share |
|---|---|---|
| United States (US) | 355 | 58.5% |
| Great Britain (GB) | 44 | 7.2% |
| India (IN) | 30 | 4.9% |
| Canada (CA) | 21 | 3.5% |
| Spain (ES) | 14 | 2.3% |
The DS Salaries dataset is best used for global trend analysis and relative comparisons (e.g., how does experience level affect salary across markets?) rather than as a source of absolute salary benchmarks for the Spanish market. Country-level filtering to ES produces a sample too small for inferential analysis.
Recommendations
The following recommendations translate the bias findings into concrete actions for TinderJob’s product, analytics, and communication teams:- Always report median salary (€93,444), never the mean. The right-skewed distribution makes the mean (€103,314) a misleading reference point that inflates candidate expectations.
- Do not use Tecnoempleo as a salary source. With 80.7% of salary fields null and a confirmed MNAR pattern, any salary statistics derived from this dataset are unreliable and potentially misleading.
- Expand scraper search terms to reduce selection bias. Review the 24 fixed terms quarterly against job board trends. Consider keyword-based discovery as a complement to reduce structural gaps.
- Complement with Spanish-specific salary sources. For reliable Spanish market salary data, integrate sources such as InfoJobs Salary Report, LinkedIn Salary Insights Spain, or Adecco/Randstad annual compensation surveys — all of which have meaningful Spanish sample sizes.
- Do not train selection or matching models on these datasets without debiasing techniques. Models trained on biased data will encode and amplify those biases in their predictions. At minimum, apply reweighting or resampling techniques before any model training.
- Communicate uncertainty to stakeholders and management. All figures presented should be accompanied by confidence intervals or explicit caveats. Salary estimates derived from this analysis are directional indicators, not guarantees — and they should be framed as such in any product copy, dashboard, or report.