Salary analysis is one of the most sought-after insights from any job market EDA — and one of the most methodologically fraught. This page documents howDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/Gema-Villanueva/proyecto-eda-roles-datos/llms.txt
Use this file to discover all available pages before exploring further.
salary_clean was constructed, what coverage looks like across sources, and how salary distributions vary by role, geography, and data source. It also explains the key caveats that must be kept in mind when interpreting any numeric salary figures from this dataset.
Coverage Summary
Salary data is not universally available across the three sources that make up the unified dataset. After cleaning and midpoint conversion, 69.46% of the 1,542 unified offers include a usablesalary_clean value. The remaining 30.54% have no salary information.
69.46% — With Salary
Offers where
salary_clean was successfully extracted and converted to a comparable EUR numeric value.30.54% — No Salary
Offers where salary was absent, described qualitatively (e.g., “competitive”), or could not be reliably parsed.
The salary_clean Methodology
Raw salary data arrives in a variety of formats across the three sources: annual ranges in EUR, hourly rates, ambiguous “up to X” figures, and sometimes salaries in USD or GBP. The cleaning pipeline harmonises these into a single salary_clean column using the following logic:
Parse the raw salary string
Extract numeric values from salary text. Handles formats such as
€40,000 - €55,000/year, $100k+, up to 60.000€, and similar patterns.Convert to annual EUR
Hourly and monthly rates are annualised using standard multipliers. Non-EUR currencies are converted using approximate exchange rates at the time of data collection.
Compute the midpoint for ranges
For offers stating a salary range (e.g.,
€100,472 - €200,938), salary_clean is set to the arithmetic midpoint — in this example, €150,705. This is an approximation of expected compensation, not a guaranteed figure.salary_clean is an estimate, not an exact salary. For range-based offers, the actual salary offered to a candidate could be anywhere within the published range. The midpoint convention is a reasonable central estimate but should not be treated as a precise figure.Source Differences
The three data sources have structurally different salary profiles, and this difference must be understood before making any cross-source comparisons.df_jobs — International source (data_science_job_posts_2025.csv)
df_jobs — International source (data_science_job_posts_2025.csv)
- 942 records total; approximately 143 explicitly Spanish entries.
- Near-complete salary coverage — the majority of records include a parseable salary field.
- International scope means many salaries are from the US, UK, or other high-wage markets.
- USD-denominated salaries converted to EUR will appear inflated relative to Spanish-market offers.
- Effect on aggregates: This source pulls the overall salary distribution significantly upward.
df_tecno — Spanish market source (tecnoempleo_spain_2026.csv)
df_tecno — Spanish market source (tecnoempleo_spain_2026.csv)
- Approximately 600 records, exclusively Spanish market.
- ~78% null salary rate — most Tecnoempleo listings omit salary information entirely.
- Where salary is present, figures reflect the actual Spanish labour market and are generally lower than
df_jobsequivalents. - Effect on aggregates: Under-represented in salary analyses due to high null rate, despite being the most representative Spanish market source.
df_scraping — Adzuna scrape
df_scraping — Adzuna scrape
- Additional Spanish recent job offers, scraped from Adzuna.
- Salary coverage varies; more complete than
df_tecnobut less so thandf_jobs. - Focuses on current Spanish market offers, complementing the other two sources.
Geographic Salary Variation
Salary distributions differ meaningfully by location. Madrid, as the dominant city in the dataset, shows a broad salary range reflecting the concentration of both local SMEs and large multinational employers. Initial analysis from Block 2 of Notebook 04 (02b_salario_rol.png) indicates:
- Madrid: Highest median salary among Spanish cities, driven by multinational presence.
- Barcelona: Second highest, with a notable startup and tech company premium.
- Other cities: Generally lower median salaries, with smaller sample sizes making statistical inference less reliable.
Geographic salary analysis is most meaningful when restricted to offers from Spanish-market sources (
df_tecno, df_scraping) and when the salary_clean_outlier flag is used to exclude extreme values. Using the full unified dataset without filtering will inflate apparent salaries for all cities due to international offers.Salary by Role Family
The Block 2 visualisation (02b_salario_rol.png) also disaggregates salary by job_family. Preliminary patterns from the EDA include:
- data_science_ai: Widest salary range; highest upper quartile, reflecting both junior ML roles and senior AI research positions.
- data_engineering: Consistently strong median salary; less variance than data science.
- analytics / BI: Lower median than engineering and science roles; compressed range.
- data_management / governance: Present in smaller numbers; salary patterns less reliable due to sample size.
Working with Salary Data
The following code demonstrates how to load the cleaned dataset, check salary coverage, remove outliers, and produce a source-stratified salary summary:Recommended filtering workflow for salary analysis
Recommended filtering workflow for salary analysis
For the most interpretable salary analysis of the Spanish market specifically, apply the following filters before aggregating:Note that after applying all three filters to Spanish-market sources with high null salary rates, the resulting sample may be small. Interpret results accordingly.
Salary Visualisation Reference
Notebook 04 produces two salary-specific visualisations:02_analisis_salarial.png— Overall salary distribution histogram and box plot forsalary_clean, with and without outlier exclusion.02b_salario_rol.png— Salary distribution broken down byjob_family, showing median, IQR, and range for each role category.
salary_clean_outlier filter to produce readable distributions. The raw (unfiltered) distribution is extremely right-skewed due to international USD-denominated offers at the high end.