Six-Stage Data Pipeline Reference: EDA Roles España

The project is built around a linear, reproducible pipeline that transforms raw job listing data from three static datasets and one live API into cleaned CSVs, analytical summaries, and chart images. Each stage is implemented as a self-contained Jupyter notebook, and every notebook’s outputs become the next notebook’s inputs. Understanding the full flow makes it straightforward to re-run individual stages, substitute data sources, or trace any result back to its origin.

Pipeline overview

┌─────────────────────────────────────────────────────────────────────┐
│                         DATA SOURCES (Stage 1)                      │
│  data_science_job_posts_2025.csv  ·  tecnoempleo_spain_2026.csv     │
│  stackoverflow_2025_results.csv   ·  Adzuna REST API (live)         │
└───────────────────────┬─────────────────────────────────────────────┘
                        │
                        ▼
┌─────────────────────────────────────────────────────────────────────┐
│              NOTEBOOK 01 — Data Collection (Stage 2)                │
│  Input : Adzuna API credentials (.env)                              │
│  Output: data/raw/scraping_jobs_raw.csv                             │
└───────────────────────┬─────────────────────────────────────────────┘
                        │ (optional — skip with existing CSVs)
                        ▼
┌─────────────────────────────────────────────────────────────────────┐
│                NOTEBOOK 02 — Cleaning (Stage 3)                     │
│  Input : data/raw/*.csv                                             │
│  Output: data/clean/*.csv (8+ files)                                │
└───────────────────────┬─────────────────────────────────────────────┘
                        │
                        ▼
┌─────────────────────────────────────────────────────────────────────┐
│                  NOTEBOOK 03 — EDA (Stage 4)                        │
│  Input : data/clean/*.csv                                           │
│  Output: data/eda/*.csv                                             │
└───────────────────────┬─────────────────────────────────────────────┘
                        │
                        ▼
┌─────────────────────────────────────────────────────────────────────┐
│             NOTEBOOK 04 — Visualizations (Stage 5)                  │
│  Input : data/eda/*.csv  (and data/clean/*.csv for some charts)     │
│  Output: images/*.png (8 visualization blocks)                      │
└───────────────────────┬─────────────────────────────────────────────┘
                        │
                        ▼
┌─────────────────────────────────────────────────────────────────────┐
│              NOTEBOOK 05 — Bias Analysis (Stage 6)                  │
│  Input : data/clean/jobs_all_clean.csv                              │
│  Output: analytical conclusions (in-notebook)                       │
└─────────────────────────────────────────────────────────────────────┘

Skipping Stage 2 (notebook 01): Notebooks 02 through 05 are fully self-contained from the data/raw/ and data/clean/ directories. If you do not have Adzuna API credentials, you can skip notebook 01 and work entirely with the three static datasets. See the Setup guide for details.

Stage 1 — Data sources

The pipeline ingests data from four sources. Three are static CSV files included with the project; one is a live REST API.

data_science_job_posts_2025.csv

Job postings dataset focused on data science roles. Static file included in data/raw/. Contains salary information and role descriptions.

tecnoempleo_spain_2026.csv

Spanish tech job listings scraped from TecnoEmpleo. Static file included in data/raw/. Provides Spain-specific location and technology data.

stackoverflow_2025_results.csv

Stack Overflow Developer Survey 2025 results. Static file included in data/raw/. Used for technology ranking and skill demand analysis.

Adzuna REST API

Live job listing API providing real-time data for Spain. Requires registration at developer.adzuna.com. Fetched by notebook 01 only.

Stage 2 — Notebook 01: Data collection

File: notebooks/01_data_collection.ipynb Notebook 01 authenticates against the Adzuna API and performs paginated requests to collect job listings for data roles in Spain. The raw results are deduplicated and exported as a single CSV.

Property	Value
Input	`ADZUNA_APP_ID` and `ADZUNA_APP_KEY` from `.env`
Output	`data/raw/scraping_jobs_raw.csv`
Key library	`requests 2.34.2`

Process steps:

Authentication

Load API credentials from .env using python-dotenv. Construct the base request URL for the Adzuna Jobs API v1 endpoint for Spain (/api/v1/jobs/gb/search/).

Paginated API calls

Iterate through result pages, collecting job listings until all available results are exhausted. Each page returns up to 50 results.

Deduplication

Remove duplicate listings using Adzuna job IDs. Jobs with identical IDs appearing across pages are collapsed to a single row.

Raw CSV export

Write the deduplicated results to data/raw/scraping_jobs_raw.csv with all original API fields preserved (no cleaning at this stage).

data/raw/ is listed in .gitignore. The raw CSV generated by this notebook is not committed to the repository. The directory ships with only a .gitkeep placeholder. You must run notebook 01 (with valid credentials) or obtain the CSV separately to populate this directory.

Stage 3 — Notebook 02: Cleaning

File: notebooks/02_cleaning.ipynb Notebook 02 is the most transformation-intensive stage. It ingests all four source files, normalises each independently, then unifies them under a common schema.

Property	Value
Input	`data/raw/*.csv` (all four source files)
Output	`data/clean/*.csv` — 8+ files
Key libraries	`pandas 3.0.3`, `numpy 2.4.6`

Process steps:

Column normalisation

Rename all columns to English snake_case. Map heterogeneous field names (e.g., puesto, job_title, JobTitle) to a single canonical schema.

Deduplication

Remove exact duplicates within each dataset and cross-dataset duplicates in the unified file. Matching is performed on title + company + location combinations.

Skill extraction

Parse free-text job descriptions and requirements fields using keyword matching to populate a structured skills column and generate the long-format technologies file.

Salary parsing

Standardise salary fields: convert ranges to midpoints, normalise to annual EUR, and flag rows with missing or unparseable salary information.

Location cleaning

Normalise Spanish city names, resolve abbreviations, and map entries to canonical province names for consistent geographic analysis.

Validation and export

Run row-count, null-rate, and schema checks. Write each cleaned dataset to data/clean/ and the unified dataset to jobs_all_clean.csv.

Stage 4 — Notebook 03: EDA

File: notebooks/03_eda.ipynb Notebook 03 performs structured exploratory analysis on the cleaned data and exports aggregated summaries for use by the visualisations notebook.

Property	Value
Input	`data/clean/*.csv`
Output	`data/eda/*.csv`
Key libraries	`pandas 3.0.3`, `scipy 1.17.1`, `statsmodels 0.14.6`

Analysis blocks:

Structure analysis

Examine dataset shape (rows, columns), data types, and memory usage. Confirm all schema constraints from the cleaning stage were applied correctly.

Null analysis

Compute null rates per column across all cleaned files. Identify columns with >50% missing data and document them as known limitations. Output: null rate summary added to data/eda/jobs_eda.csv.

Distribution analysis

Compute frequency distributions for categorical columns (role, location, work mode, contract type) and descriptive statistics for numeric columns (salary). Identify skew and outliers.

Ranking generation

Aggregate technology mention counts from technologies_clean_long_format.csv and cross-reference with Stack Overflow survey data to produce ranked lists of demanded and used technologies. Output: data/clean/technology_rankings.csv.

Stage 5 — Notebook 04: Visualizations

File: notebooks/04-visualizations.ipynb Notebook 04 produces all charts for the project. Plotly is used for interactive exploration; kaleido exports each chart as a static PNG for documentation and presentations.

Property	Value
Input	`data/eda/.csv` and selected `data/clean/.csv` files
Output	`images/*.png` — one file per visualisation block
Key libraries	`matplotlib 3.10.9`, `seaborn 0.13.2`, `plotly 6.7.0`, `kaleido 1.3.0`, `squarify 0.4.4`

Visualisation blocks:

File	Chart description
`00_calidad_datos.png`	Data quality summary — null rates and completeness by column
`01_distribucion_volumen.png`	Volume distribution — job count by source dataset
Additional blocks	Role frequency, location heatmap, salary box plots, technology rankings, work mode breakdown, skills treemap

Plotly charts are exported to PNG using kaleido:

import plotly.io as pio

fig.write_image("images/01_distribucion_volumen.png", width=1200, height=600)

Stage 6 — Notebook 05: Bias analysis

File: notebooks/05_bias_analysis.ipynb Notebook 05 applies a structured bias-identification framework to the unified dataset. The outputs are in-notebook conclusions rather than additional CSV files.

Property	Value
Input	`data/clean/jobs_all_clean.csv`
Output	In-notebook analytical conclusions
Key libraries	`pandas 3.0.3`, `scipy 1.17.1`

Bias types investigated:

Representation bias

Are certain roles (e.g., Data Engineer vs. Data Analyst) over- or under-represented relative to the actual job market? Quantified by comparing source-level frequencies.

Location bias

Do Madrid and Barcelona dominate to a degree that skews national conclusions? Measured by listing concentration ratios across provinces.

Seniority bias

Are junior, mid-level, and senior roles proportionally represented? Assessed by parsing seniority signals from job titles and descriptions.

Salary data availability bias

Does the ~69% salary disclosure rate (with ~31% missing) correlate with specific roles, locations, or sources? Analysis identifies whether missingness is random or systematic.

Data lineage table

The table below maps every input file to the output files it contributes to, tracing the complete lineage through the pipeline.

Input file	Stage	Output file(s)
`data_science_job_posts_2025.csv`	Cleaning (02)	`jobs_clean.csv`, `jobs_all_clean.csv`
`tecnoempleo_spain_2026.csv`	Cleaning (02)	`tecno_jobs_clean.csv`, `jobs_all_clean.csv`
`stackoverflow_2025_results.csv`	Cleaning (02)	`technology_rankings.csv`, `technologies_clean_long_format.csv`
`scraping_jobs_raw.csv` (Adzuna)	Cleaning (02)	`jobs_all_clean.csv`
`jobs_all_clean.csv`	EDA (03)	`jobs_eda.csv`, in-notebook statistics
`technologies_clean_long_format.csv`	EDA (03)	`technology_rankings.csv`
`jobs_eda.csv`	Visualizations (04)	`images/00_calidad_datos.png`, `images/01_distribucion_volumen.png`, …
`technology_rankings.csv`	Visualizations (04)	Technology ranking chart PNGs
`jobs_all_clean.csv`	Bias analysis (05)	In-notebook analytical conclusions

Running individual stages

Each stage can be re-run independently as long as its input files are present. This table shows the minimum files required to start from any given stage:

Start from	Minimum files required
Stage 2 (Notebook 01)	Valid `.env` with Adzuna credentials
Stage 3 (Notebook 02)	`data/raw/` — any combination of the 4 source CSVs
Stage 4 (Notebook 03)	`data/clean/*.csv`
Stage 5 (Notebook 04)	`data/eda/.csv` (and selected `data/clean/.csv`)
Stage 6 (Notebook 05)	`data/clean/jobs_all_clean.csv` only

The fastest way to explore the analysis without any setup is to start at Stage 4 (Notebook 03). The data/clean/ directory is committed to the repository with all pre-processed CSVs, so no credentials or prior notebook execution are needed.

Configuración

Scripts y Utilidades

Six-Stage Data Pipeline Reference: EDA Roles España

Pipeline overview

Stage 1 — Data sources

data_science_job_posts_2025.csv

tecnoempleo_spain_2026.csv

stackoverflow_2025_results.csv

Adzuna REST API

Stage 2 — Notebook 01: Data collection

Stage 3 — Notebook 02: Cleaning

Stage 4 — Notebook 03: EDA

Stage 5 — Notebook 04: Visualizations

Stage 6 — Notebook 05: Bias analysis

Representation bias

Location bias

Seniority bias

Salary data availability bias

Data lineage table

Running individual stages

Build docs developers (and LLMs) love

Configuración

Scripts y Utilidades

Documentation Index

​Pipeline overview

​Stage 1 — Data sources

data_science_job_posts_2025.csv

tecnoempleo_spain_2026.csv

stackoverflow_2025_results.csv

Adzuna REST API

​Stage 2 — Notebook 01: Data collection

​Stage 3 — Notebook 02: Cleaning

​Stage 4 — Notebook 03: EDA

​Stage 5 — Notebook 04: Visualizations

​Stage 6 — Notebook 05: Bias analysis

Representation bias

Location bias

Seniority bias

Salary data availability bias

​Data lineage table

​Running individual stages

Build docs developers (and LLMs) love

Pipeline overview

Stage 1 — Data sources

Stage 2 — Notebook 01: Data collection

Stage 3 — Notebook 02: Cleaning

Stage 4 — Notebook 03: EDA

Stage 5 — Notebook 04: Visualizations

Stage 6 — Notebook 05: Bias analysis

Data lineage table

Running individual stages