Every data analysis project operates within constraints set by its sources, methods, and scope. This page documents the known limitations and biases of the EDA Roles de Datos en España project clearly and systematically. Understanding these constraints is essential for interpreting any finding from this analysis responsibly — not to dismiss the results, but to calibrate confidence appropriately. Notebook 05 (Documentation Index
Fetch the complete documentation index at: https://mintlify.com/Gema-Villanueva/proyecto-eda-roles-datos/llms.txt
Use this file to discover all available pages before exploring further.
05_estadistica_avanzada.ipynb) formally identifies 10 bias types that affect the dataset. These are documented below together with additional methodological and coverage limitations. The limitations are grouped into three categories: Data Quality, Methodological, and Coverage.
Data Quality Limitations
These limitations concern the completeness, consistency, and reliability of the underlying data before any analysis is applied.1. Source Bias (Representation Bias)
1. Source Bias (Representation Bias)
df_jobs, df_tecno, and df_scraping — that differ in geographic scope, time window, schema, salary reporting conventions, and language. The distribution of records across sources is unequal: df_jobs alone contributes 942 of the 1,542 records (61%), yet it is the least representative of the Spanish labour market. This source imbalance means the overall dataset over-represents the internationally-visible segment of data roles and under-represents the Spanish SME market.Implication: Aggregate statistics across the full unified dataset blend populations that are not strictly comparable. Stratifying by source_dataset is strongly recommended before drawing conclusions.2. Salary Information Bias
2. Salary Information Bias
salary_clean column is derived by taking the midpoint of advertised salary ranges and converting to EUR. This is an estimate, not an exact compensation figure. Additionally, 30.54% of offers have no salary data at all, and df_tecno alone has approximately 78% null salary rate. Because df_jobs (the international source) has near-complete salary coverage, any salary aggregate over the full dataset is weighted towards international — particularly US-market — compensation levels, significantly inflating apparent salaries relative to what the Spanish market offers.Implication: Salary statistics are biased towards sources with higher reporting rates (principally df_jobs). Spanish-market salary patterns are under-represented. Any salary figure should be understood as an approximation with meaningful uncertainty. See the Salary Analysis page for recommended filtering workflows.3. Work Modality Bias
3. Work Modality Bias
work_modality column is unknown. This does not mean the role is on-site — it means the job posting did not state the modality clearly enough to be classified. Many Spanish job ads describe remote or hybrid arrangements in free-text prose that the classification rules did not capture. The true prevalence of remote and hybrid work in the Spanish data market is therefore not measurable from this dataset as structured.Implication: Analysis of remote/hybrid/on-site distribution is unreliable. Statements about the prevalence of remote work in the Spanish data market cannot be supported by this dataset as structured.4. Seniority Bias
4. Seniority Bias
seniority_level) is not uniformly available across all three sources. Where it is populated, senior-level roles appear to be over-represented relative to the actual market distribution — partly because senior roles are more likely to be formally advertised and described with explicit seniority language. Junior and graduate-level positions in smaller companies often lack clear seniority labels in their postings.Implication: Salary-by-seniority and skill-by-seniority analyses should be treated cautiously. The dataset does not provide a reliable picture of the full seniority spectrum.5. Contract Type Bias
5. Contract Type Bias
6. Skills and Description Bias
6. Skills and Description Bias
skills field and free-text descriptions). This approach has systematic biases: skills mentioned in standardised bullet-point lists are captured more reliably than skills described in prose; skills named differently across offers (e.g., “ML”, “machine learning”, “machine-learning”) may be counted separately or inconsistently normalised; and offers with no structured skills section contribute zero skill observations regardless of what the role actually requires.Implication: Skill frequency rankings reflect how often skills are explicitly named in job postings, not their true importance or prevalence. Skills that employers assume without stating (e.g., basic Python in a senior role) are systematically under-counted.Methodological Limitations
These limitations concern decisions made during data collection, cleaning, classification, and analysis.7. Location Bias
7. Location Bias
city_clean field represents the best available classification, but it may group geographically distinct locations or miss fine-grained distinctions. Additionally, offers without a stated location default to unknown city — roles that are fully remote but unlocated are not separately identifiable.Implication: City-level geographic analysis is approximate. National-level aggregates are reliable; finer-grained geographic comparisons carry more uncertainty. Fully remote roles without a stated location are not counted under any city, leading to under-counting of remote work relative to on-site roles.8. Rule-Based Classification (Regex)
8. Rule-Based Classification (Regex)
job_family) and work modality (work_modality) classifications are assigned using regex pattern matching against job titles, descriptions, and other text fields. This is a deterministic, transparent approach, but it has known failure modes: ambiguous titles may be misclassified, new role labels that don’t match existing patterns are left unclassified, and the rules encode the assumptions of whoever designed the regex patterns.Implication: Classification is not ground truth. Edge cases exist. Any finding that depends on the job_family or work_modality categories inherits the errors and assumptions of the classification rules.9. Salary Midpoint Convention
9. Salary Midpoint Convention
salary_clean is set to €47,500 — the arithmetic midpoint. In practice, most hiring outcomes cluster towards the lower end of published ranges. The midpoint assumption may therefore slightly overestimate typical compensation.Implication: salary_clean is an approximation. For range analysis (e.g., comparing role families), it is fit for purpose. For precise compensation benchmarking, it should not be relied upon.10. Temporal Bias
10. Temporal Bias
df_jobs covers early-to-mid 2025, df_tecno covers early 2026, and df_scraping represents a point-in-time Adzuna snapshot. These windows are close but not identical. Market conditions, hiring volumes, and technology demand can shift significantly over the span of months in the fast-moving data roles sector.Implication: The dataset captures a snapshot, not a trend. Year-over-year comparisons or trend analysis are not supported by this data. Seasonality effects (e.g., Q1 hiring peaks) cannot be controlled for, and the slight temporal misalignment between sources introduces a small additional confound.Coverage Limitations
These limitations concern which parts of the market are included in the data and which are absent.11. Stack Overflow Survey Bias
11. Stack Overflow Survey Bias
df_stack are global community signals, not Spanish market demand metrics. High usage of a technology on Stack Overflow does not mean Spanish employers are hiring for it at the same rate.12. Geographic Concentration — Madrid and Barcelona
12. Geographic Concentration — Madrid and Barcelona
13. Anti-Bot Scraping Limitations — InfoJobs and Indeed Excluded
13. Anti-Bot Scraping Limitations — InfoJobs and Indeed Excluded
df_jobs) or on platforms with lower scraping defences (Adzuna). The findings are more representative of the tech-forward, internationally visible segment of the Spanish data job market than of the full market.14. International Salary Inflation — US/International Offers in df_jobs
14. International Salary Inflation — US/International Offers in df_jobs
df_jobs contains 942 records with international scope; only approximately 143 are explicitly Spanish. Many of the international records include US-denominated salaries, which — even after EUR conversion — are substantially higher than Spanish-market equivalents. These records remain in the unified dataset and contribute to salary statistics unless explicitly filtered out.Implication: Any salary aggregate that includes the full df_jobs source without filtering to Spanish records will significantly overstate what Spanish employers are paying. See the Salary Analysis page for recommended filtering workflows.15. Sector Over-representation — Tech Sector
15. Sector Over-representation — Tech Sector
Notebook 05 — 10 Bias Types Summary
Notebook 05 (05_estadistica_avanzada.ipynb) formally catalogues the following 10 bias types affecting this dataset. The table maps each to the detailed limitation accordion above:
| # | Bias Type (Notebook 05) | Detailed In |
|---|---|---|
| 1 | Representation bias | Accordion 1 — Source Bias |
| 2 | Location bias | Accordion 7 — Location Bias |
| 3 | Salary information bias | Accordions 2 & 9 — Salary Information Bias / Midpoint Convention |
| 4 | Seniority bias | Accordion 4 — Seniority Bias |
| 5 | Work modality bias | Accordion 3 — Work Modality Bias |
| 6 | Contract type bias | Accordion 5 — Contract Type Bias |
| 7 | Sector bias | Accordion 15 — Sector Over-representation |
| 8 | Temporal bias | Accordion 10 — Temporal Bias |
| 9 | Source bias | Accordions 1 & 11 — Source Bias / Stack Overflow |
| 10 | Skills/description bias | Accordion 6 — Skills and Description Bias |
Summary of Bias Direction
The table below summarises the direction of the most significant biases in the dataset:| Dimension | Direction of Bias | Primary Cause |
|---|---|---|
| Job family distribution | Overweights data_science_ai | Source selection and InfoJobs/Indeed exclusion |
| Geographic distribution | Overweights Madrid and Barcelona | Platform coverage and scraping constraints |
| Salary level | Inflated upward | International offers in df_jobs |
| Salary coverage | Biased to df_jobs | High null rate in df_tecno |
| Work modality | unknown inflated | Free-text modality descriptions not captured |
| Contract type | Under-reported | Inconsistent structured fields across sources |
| Seniority representation | Senior roles over-represented | Formal JD conventions; junior roles often informal |
| Sector | Tech sector over-represented | Platform and source selection |
| Skills/description | Explicit mentions over-counted | Keyword extraction misses implicit requirements |
| Temporal coverage | Snapshot only, slight misalignment | Different collection windows per source |
| Data completeness | English-language offers better populated | Multinational JD template conventions |