The project is built around a linear, reproducible pipeline that transforms raw job listing data from three static datasets and one live API into cleaned CSVs, analytical summaries, and chart images. Each stage is implemented as a self-contained Jupyter notebook, and every notebook’s outputs become the next notebook’s inputs. Understanding the full flow makes it straightforward to re-run individual stages, substitute data sources, or trace any result back to its origin.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/Gema-Villanueva/proyecto-eda-roles-datos/llms.txt
Use this file to discover all available pages before exploring further.
Pipeline overview
Skipping Stage 2 (notebook 01): Notebooks 02 through 05 are fully self-contained from the
data/raw/ and data/clean/ directories. If you do not have Adzuna API credentials, you can skip notebook 01 and work entirely with the three static datasets. See the Setup guide for details.Stage 1 — Data sources
The pipeline ingests data from four sources. Three are static CSV files included with the project; one is a live REST API.data_science_job_posts_2025.csv
Job postings dataset focused on data science roles. Static file included in
data/raw/. Contains salary information and role descriptions.tecnoempleo_spain_2026.csv
Spanish tech job listings scraped from TecnoEmpleo. Static file included in
data/raw/. Provides Spain-specific location and technology data.stackoverflow_2025_results.csv
Stack Overflow Developer Survey 2025 results. Static file included in
data/raw/. Used for technology ranking and skill demand analysis.Adzuna REST API
Live job listing API providing real-time data for Spain. Requires registration at developer.adzuna.com. Fetched by notebook 01 only.
Stage 2 — Notebook 01: Data collection
File:notebooks/01_data_collection.ipynb
Notebook 01 authenticates against the Adzuna API and performs paginated requests to collect job listings for data roles in Spain. The raw results are deduplicated and exported as a single CSV.
| Property | Value |
|---|---|
| Input | ADZUNA_APP_ID and ADZUNA_APP_KEY from .env |
| Output | data/raw/scraping_jobs_raw.csv |
| Key library | requests 2.34.2 |
Authentication
Load API credentials from
.env using python-dotenv. Construct the base request URL for the Adzuna Jobs API v1 endpoint for Spain (/api/v1/jobs/gb/search/).Paginated API calls
Iterate through result pages, collecting job listings until all available results are exhausted. Each page returns up to 50 results.
Deduplication
Remove duplicate listings using Adzuna job IDs. Jobs with identical IDs appearing across pages are collapsed to a single row.
Stage 3 — Notebook 02: Cleaning
File:notebooks/02_cleaning.ipynb
Notebook 02 is the most transformation-intensive stage. It ingests all four source files, normalises each independently, then unifies them under a common schema.
| Property | Value |
|---|---|
| Input | data/raw/*.csv (all four source files) |
| Output | data/clean/*.csv — 8+ files |
| Key libraries | pandas 3.0.3, numpy 2.4.6 |
Column normalisation
Rename all columns to English snake_case. Map heterogeneous field names (e.g.,
puesto, job_title, JobTitle) to a single canonical schema.Deduplication
Remove exact duplicates within each dataset and cross-dataset duplicates in the unified file. Matching is performed on title + company + location combinations.
Skill extraction
Parse free-text job descriptions and requirements fields using keyword matching to populate a structured
skills column and generate the long-format technologies file.Salary parsing
Standardise salary fields: convert ranges to midpoints, normalise to annual EUR, and flag rows with missing or unparseable salary information.
Location cleaning
Normalise Spanish city names, resolve abbreviations, and map entries to canonical province names for consistent geographic analysis.
Stage 4 — Notebook 03: EDA
File:notebooks/03_eda.ipynb
Notebook 03 performs structured exploratory analysis on the cleaned data and exports aggregated summaries for use by the visualisations notebook.
| Property | Value |
|---|---|
| Input | data/clean/*.csv |
| Output | data/eda/*.csv |
| Key libraries | pandas 3.0.3, scipy 1.17.1, statsmodels 0.14.6 |
Structure analysis
Structure analysis
Examine dataset shape (rows, columns), data types, and memory usage. Confirm all schema constraints from the cleaning stage were applied correctly.
Null analysis
Null analysis
Compute null rates per column across all cleaned files. Identify columns with >50% missing data and document them as known limitations. Output: null rate summary added to
data/eda/jobs_eda.csv.Distribution analysis
Distribution analysis
Compute frequency distributions for categorical columns (role, location, work mode, contract type) and descriptive statistics for numeric columns (salary). Identify skew and outliers.
Ranking generation
Ranking generation
Aggregate technology mention counts from
technologies_clean_long_format.csv and cross-reference with Stack Overflow survey data to produce ranked lists of demanded and used technologies. Output: data/clean/technology_rankings.csv.Stage 5 — Notebook 04: Visualizations
File:notebooks/04-visualizations.ipynb
Notebook 04 produces all charts for the project. Plotly is used for interactive exploration; kaleido exports each chart as a static PNG for documentation and presentations.
| Property | Value |
|---|---|
| Input | data/eda/*.csv and selected data/clean/*.csv files |
| Output | images/*.png — one file per visualisation block |
| Key libraries | matplotlib 3.10.9, seaborn 0.13.2, plotly 6.7.0, kaleido 1.3.0, squarify 0.4.4 |
| File | Chart description |
|---|---|
00_calidad_datos.png | Data quality summary — null rates and completeness by column |
01_distribucion_volumen.png | Volume distribution — job count by source dataset |
| Additional blocks | Role frequency, location heatmap, salary box plots, technology rankings, work mode breakdown, skills treemap |
Stage 6 — Notebook 05: Bias analysis
File:notebooks/05_bias_analysis.ipynb
Notebook 05 applies a structured bias-identification framework to the unified dataset. The outputs are in-notebook conclusions rather than additional CSV files.
| Property | Value |
|---|---|
| Input | data/clean/jobs_all_clean.csv |
| Output | In-notebook analytical conclusions |
| Key libraries | pandas 3.0.3, scipy 1.17.1 |
Representation bias
Are certain roles (e.g., Data Engineer vs. Data Analyst) over- or under-represented relative to the actual job market? Quantified by comparing source-level frequencies.
Location bias
Do Madrid and Barcelona dominate to a degree that skews national conclusions? Measured by listing concentration ratios across provinces.
Seniority bias
Are junior, mid-level, and senior roles proportionally represented? Assessed by parsing seniority signals from job titles and descriptions.
Salary data availability bias
Does the ~69% salary disclosure rate (with ~31% missing) correlate with specific roles, locations, or sources? Analysis identifies whether missingness is random or systematic.
Data lineage table
The table below maps every input file to the output files it contributes to, tracing the complete lineage through the pipeline.| Input file | Stage | Output file(s) |
|---|---|---|
data_science_job_posts_2025.csv | Cleaning (02) | jobs_clean.csv, jobs_all_clean.csv |
tecnoempleo_spain_2026.csv | Cleaning (02) | tecno_jobs_clean.csv, jobs_all_clean.csv |
stackoverflow_2025_results.csv | Cleaning (02) | technology_rankings.csv, technologies_clean_long_format.csv |
scraping_jobs_raw.csv (Adzuna) | Cleaning (02) | jobs_all_clean.csv |
jobs_all_clean.csv | EDA (03) | jobs_eda.csv, in-notebook statistics |
technologies_clean_long_format.csv | EDA (03) | technology_rankings.csv |
jobs_eda.csv | Visualizations (04) | images/00_calidad_datos.png, images/01_distribucion_volumen.png, … |
technology_rankings.csv | Visualizations (04) | Technology ranking chart PNGs |
jobs_all_clean.csv | Bias analysis (05) | In-notebook analytical conclusions |
Running individual stages
Each stage can be re-run independently as long as its input files are present. This table shows the minimum files required to start from any given stage:| Start from | Minimum files required |
|---|---|
| Stage 2 (Notebook 01) | Valid .env with Adzuna credentials |
| Stage 3 (Notebook 02) | data/raw/ — any combination of the 4 source CSVs |
| Stage 4 (Notebook 03) | data/clean/*.csv |
| Stage 5 (Notebook 04) | data/eda/*.csv (and selected data/clean/*.csv) |
| Stage 6 (Notebook 05) | data/clean/jobs_all_clean.csv only |