Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/HelenDiMo/TinderJob/llms.txt

Use this file to discover all available pages before exploring further.

The third notebook in the series (analisis_03_informe_sesgos.ipynb) is a dedicated bias audit — the analytical counterpart to the descriptive and inferential work in Notebooks 1 and 2. Every dataset carries the fingerprint of how it was collected, and TinderJob’s datasets are no exception. This notebook documents three distinct categories of bias affecting the Tecnoempleo scraped data and the DS Salaries reference dataset, quantifies their impact on analytical reliability, and provides concrete recommendations to prevent flawed data from producing flawed — or actively harmful — business decisions. Transparency about limitations is not a weakness in data analysis; it is the foundation of responsible product development.

Bias Summary Table

The three biases identified span both datasets and affect different analytical dimensions. Their combined effect means that any salary figures or hiring recommendations derived from these datasets must be communicated with explicit uncertainty bounds.
Bias TypeDatasetCategoryImpact
MNAR — Missing SalariesTecnoempleoMissing Not At Random80.7% null salaries — salary analysis infeasible
Search Term BiasTecnoempleoSelection Bias24 fixed terms exclude all unlisted role profiles
Geographic UnderrepresentationDS SalariesGeographic BiasSpain = 2.3% of records (14/607) — unreliable Spanish stats

1. MNAR Salary Analysis (Missing Not At Random)

Missing data is not a monolithic problem. Its statistical treatment depends critically on why the data is missing. The three standard classifications are MCAR (Missing Completely At Random), MAR (Missing At Random, conditional on observed variables), and MNAR (Missing Not At Random — where the missingness is correlated with the unobserved value itself). The Tecnoempleo salary field exhibits a clear MNAR pattern: companies are systematically less likely to publish a salary when the salary is either very high (negotiable senior roles) or very low (roles where publishing would deter candidates). The missingness is not random — it is a function of the salary itself, which means it cannot be addressed by standard imputation techniques without introducing systematic bias.
null_rate = df['salario'].isnull().mean() * 100
print(f'Salary null rate: {null_rate:.1f}%')

# Check if nulls correlate with profile
null_by_profile = df.groupby('busqueda')['salario'].apply(
    lambda x: x.isnull().mean() * 100
).sort_values(ascending=False)
print(null_by_profile)
Running this analysis confirms that the null rate varies significantly across search term categories — senior and specialised profiles exhibit higher null rates than junior or commodity roles. This variation is the empirical signature of MNAR.

Overall Null Rate

80.7% of Tecnoempleo salario fields are null — less than 1 in 5 postings include any salary information.

Analytical Consequence

Direct Spanish job market salary analysis using Tecnoempleo is infeasible. The observed salaries represent a non-representative, self-selected subset of postings.
MNAR bias is the most dangerous of the three missingness types. MAR and MCAR can be addressed with imputation — techniques such as multiple imputation by chained equations (MICE) or KNN imputation. MNAR cannot: any imputation model trained on observed salaries will reproduce the selection bias embedded in those observations. Addressing MNAR requires domain knowledge and auxiliary data sources such as salary surveys, LinkedIn compensation data, or regulatory filings.

2. Scraper Selection Bias

The Tecnoempleo scraper retrieves job listings by submitting a fixed list of search terms to the platform. This design choice — pragmatic from an engineering standpoint — introduces a structural selection bias: only roles that match one of the 24 predefined terms are captured. Any tech role not represented in this list is invisible to the entire analysis pipeline. The 24 search terms used by the scraper are:
data scientist
data analyst
data engineer
machine learning
python developer
backend developer
frontend developer
full stack developer
devops
cloud engineer
software engineer
java developer
javascript developer
react developer
angular developer
node developer
mobile developer
ios developer
android developer
cybersecurity
qa engineer
product manager
scrum master
ux designer
Notable absences include roles such as Blockchain Developer, AR/VR Engineer, Embedded Systems Engineer, Quantum Computing Researcher, and AI Ethics Specialist — all active hiring areas that are structurally excluded from the dataset. As the tech job market diversifies and new role categories emerge, the gap between the scraper’s coverage and the true market will widen unless the search term list is actively maintained. Mitigation strategies:
  • Expand the search term list through periodic review against job board trending terms
  • Implement keyword-based discovery: scrape category pages rather than individual search queries
  • Cross-validate coverage against a secondary source (e.g., LinkedIn job counts by role) to estimate the proportion of the market being captured

3. Geographic Underrepresentation (DS Salaries)

The DS Salaries dataset is widely used as a reference benchmark for data science and ML compensation. However, its geographic composition renders it unreliable for Spanish market analysis. The United States dominates the dataset, and Spain’s representation is statistically negligible.
country_dist = df_sal['company_location'].value_counts()
es_pct = country_dist.get('ES', 0) / len(df_sal) * 100
print(f'Spain representation: {es_pct:.1f}%')
CountryRecordsShare
United States (US)35558.5%
Great Britain (GB)447.2%
India (IN)304.9%
Canada (CA)213.5%
Spain (ES)142.3%
With only 14 Spanish records out of 607, any sub-group analysis of Spanish salaries by experience level, company size, or role type will be statistically unreliable — most cells will contain fewer than 5 observations, making any percentage-based claim meaningless. Using these figures to benchmark Spanish candidate salaries would be misleading.
The DS Salaries dataset is best used for global trend analysis and relative comparisons (e.g., how does experience level affect salary across markets?) rather than as a source of absolute salary benchmarks for the Spanish market. Country-level filtering to ES produces a sample too small for inferential analysis.

Recommendations

The following recommendations translate the bias findings into concrete actions for TinderJob’s product, analytics, and communication teams:
  1. Always report median salary (€93,444), never the mean. The right-skewed distribution makes the mean (€103,314) a misleading reference point that inflates candidate expectations.
  2. Do not use Tecnoempleo as a salary source. With 80.7% of salary fields null and a confirmed MNAR pattern, any salary statistics derived from this dataset are unreliable and potentially misleading.
  3. Expand scraper search terms to reduce selection bias. Review the 24 fixed terms quarterly against job board trends. Consider keyword-based discovery as a complement to reduce structural gaps.
  4. Complement with Spanish-specific salary sources. For reliable Spanish market salary data, integrate sources such as InfoJobs Salary Report, LinkedIn Salary Insights Spain, or Adecco/Randstad annual compensation surveys — all of which have meaningful Spanish sample sizes.
  5. Do not train selection or matching models on these datasets without debiasing techniques. Models trained on biased data will encode and amplify those biases in their predictions. At minimum, apply reweighting or resampling techniques before any model training.
  6. Communicate uncertainty to stakeholders and management. All figures presented should be accompanied by confidence intervals or explicit caveats. Salary estimates derived from this analysis are directional indicators, not guarantees — and they should be framed as such in any product copy, dashboard, or report.
Taking strategic hiring or compensation decisions based on biased or incomplete data carries real-world consequences beyond analytical error. Hiring algorithms trained on non-representative data can perpetuate and amplify structural inequalities — systematically disadvantaging candidates from underrepresented geographies, seniority levels, or demographic groups. Bias documentation is not bureaucratic compliance; it is an ethical obligation.

Build docs developers (and LLMs) love