This project is a collaborative exploratory data analysis (EDA) of the Spanish data job market, built as part of a data bootcamp. It aggregates and cleans real job offer data from multiple sources to answer practical questions: which skills employers actually demand, where data jobs are concentrated geographically, how salaries compare across roles, and what biases may exist in the available data. The goal is to give aspiring data professionals a grounded, evidence-based picture of what the job market looks like in Spain in 2025–2026.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/Gema-Villanueva/proyecto-eda-roles-datos/llms.txt
Use this file to discover all available pages before exploring further.
Why This Project Matters
The data job market in Spain is growing fast, but published salary figures are sparse and role definitions vary widely. By combining local job boards, international datasets, and a developer survey, this project builds a more complete picture than any single source could provide — and surfaces the gaps and biases that practitioners should know about before drawing conclusions from raw listings.Data Sources
The analysis starts from three raw datasets, each with distinct characteristics and limitations.Tecnoempleo Spain 2026
600 local job offers scraped from Tecnoempleo, a Spanish tech job board. Covers Spanish-market roles with local context, but 78% of salary fields are null — making direct salary analysis on this source unreliable without augmentation.
Data Science Job Posts 2025
944 international offers, of which 143 are based in Spain. Provides broader role taxonomy and cross-market comparison, though international listings may not reflect local hiring conditions accurately.
Stack Overflow Survey 2025
Developer survey responses capturing technology preferences and adoption trends. Used primarily for skills and tooling analysis rather than job offer specifics. Most desired technology: OpenAI GPT (chatbot models).
How the Unified Dataset Was Built
The three sources were merged, cleaned, and deduplicated into a single file —jobs_all_clean.csv — containing 1,542 offers and 17 columns. Salary data is present in approximately 69.46% of records, a significant improvement over the raw Tecnoempleo source.
Selecting the Adzuna API
To supplement the initial datasets — especially to improve salary coverage — the team evaluated several data collection approaches:Why not InfoJobs or Indeed?
Why not InfoJobs or Indeed?
Both InfoJobs and Indeed were explored as potential data sources. InfoJobs requires authenticated API access with a lengthy approval process. Indeed actively blocks automated data collection with CAPTCHA challenges and Cloudflare protections, making reliable scraping infeasible within the bootcamp timeline. Both paths were abandoned after initial attempts were blocked.
01_data_collection.ipynb.
Key Findings at a Glance
Most Demanded Skill
Python is the top skill requested across all data job families in Spain, appearing consistently in listings for data science, data engineering, and analytics roles.
Dominant Location
Madrid is the dominant city for data roles, concentrating the largest share of listings by a significant margin over Barcelona and other Spanish cities.
Top Job Family
data_science_ai is the most frequent job family in the unified dataset, reflecting strong employer demand for machine learning and AI-adjacent roles.
Salary Coverage
Salary data is available for ~69.46% of offers in the final dataset — better than the raw sources, but still incomplete enough to warrant caution in salary-based comparisons.
The 78% null salary rate in the Tecnoempleo dataset is a known limitation. Conclusions drawn from salary analysis should account for this missingness, which is explored in depth in the bias analysis notebook.
The Five-Notebook Pipeline
The project is structured as a sequential pipeline of five Jupyter notebooks. Each notebook has a focused scope and passes its outputs to the next stage.Data Collection
01_data_collection.ipynb — Calls the Adzuna API to retrieve job listings and saves raw responses. Requires valid API credentials in a .env file.Cleaning & Preparation
02_cleaning.ipynb — Merges the three raw datasets with the Adzuna data, standardises column names, normalises job families, handles nulls, and exports the unified jobs_all_clean.csv to data/clean/.Exploratory Data Analysis
03_eda.ipynb — Computes summary statistics, correlation matrices, skill frequency counts, and geographic distributions. Produces the core quantitative findings.Visualizations
04-visualizations.ipynb — Generates publication-quality charts using matplotlib, seaborn, and plotly. Outputs are saved to the images/ directory and support the final report.Notebook Reference
Data Collection
Adzuna API integration and raw data retrieval
Cleaning
Dataset merging, normalisation, and export
EDA
Summary statistics, skill counts, and geographic analysis
Visualizations
matplotlib, seaborn, and plotly chart generation
Bias Analysis
Salary missingness, geographic bias, and dataset caveats
Team
This project was built as a bootcamp group assignment. David serves as the Data Analyst responsible for data collection — including the Adzuna API integration and the raw ingestion pipeline in01_data_collection.ipynb. The broader team contributed to cleaning, analysis, and visualisation across the remaining notebooks.