Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/Gema-Villanueva/proyecto-eda-roles-datos/llms.txt

Use this file to discover all available pages before exploring further.

The project is built around a linear, reproducible pipeline that transforms raw job listing data from three static datasets and one live API into cleaned CSVs, analytical summaries, and chart images. Each stage is implemented as a self-contained Jupyter notebook, and every notebook’s outputs become the next notebook’s inputs. Understanding the full flow makes it straightforward to re-run individual stages, substitute data sources, or trace any result back to its origin.

Pipeline overview

┌─────────────────────────────────────────────────────────────────────┐
│                         DATA SOURCES (Stage 1)                      │
│  data_science_job_posts_2025.csv  ·  tecnoempleo_spain_2026.csv     │
│  stackoverflow_2025_results.csv   ·  Adzuna REST API (live)         │
└───────────────────────┬─────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────────────┐
│              NOTEBOOK 01 — Data Collection (Stage 2)                │
│  Input : Adzuna API credentials (.env)                              │
│  Output: data/raw/scraping_jobs_raw.csv                             │
└───────────────────────┬─────────────────────────────────────────────┘
                        │ (optional — skip with existing CSVs)

┌─────────────────────────────────────────────────────────────────────┐
│                NOTEBOOK 02 — Cleaning (Stage 3)                     │
│  Input : data/raw/*.csv                                             │
│  Output: data/clean/*.csv (8+ files)                                │
└───────────────────────┬─────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────────────┐
│                  NOTEBOOK 03 — EDA (Stage 4)                        │
│  Input : data/clean/*.csv                                           │
│  Output: data/eda/*.csv                                             │
└───────────────────────┬─────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────────────┐
│             NOTEBOOK 04 — Visualizations (Stage 5)                  │
│  Input : data/eda/*.csv  (and data/clean/*.csv for some charts)     │
│  Output: images/*.png (8 visualization blocks)                      │
└───────────────────────┬─────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────────────┐
│              NOTEBOOK 05 — Bias Analysis (Stage 6)                  │
│  Input : data/clean/jobs_all_clean.csv                              │
│  Output: analytical conclusions (in-notebook)                       │
└─────────────────────────────────────────────────────────────────────┘
Skipping Stage 2 (notebook 01): Notebooks 02 through 05 are fully self-contained from the data/raw/ and data/clean/ directories. If you do not have Adzuna API credentials, you can skip notebook 01 and work entirely with the three static datasets. See the Setup guide for details.

Stage 1 — Data sources

The pipeline ingests data from four sources. Three are static CSV files included with the project; one is a live REST API.

data_science_job_posts_2025.csv

Job postings dataset focused on data science roles. Static file included in data/raw/. Contains salary information and role descriptions.

tecnoempleo_spain_2026.csv

Spanish tech job listings scraped from TecnoEmpleo. Static file included in data/raw/. Provides Spain-specific location and technology data.

stackoverflow_2025_results.csv

Stack Overflow Developer Survey 2025 results. Static file included in data/raw/. Used for technology ranking and skill demand analysis.

Adzuna REST API

Live job listing API providing real-time data for Spain. Requires registration at developer.adzuna.com. Fetched by notebook 01 only.

Stage 2 — Notebook 01: Data collection

File: notebooks/01_data_collection.ipynb Notebook 01 authenticates against the Adzuna API and performs paginated requests to collect job listings for data roles in Spain. The raw results are deduplicated and exported as a single CSV.
PropertyValue
InputADZUNA_APP_ID and ADZUNA_APP_KEY from .env
Outputdata/raw/scraping_jobs_raw.csv
Key libraryrequests 2.34.2
Process steps:
1

Authentication

Load API credentials from .env using python-dotenv. Construct the base request URL for the Adzuna Jobs API v1 endpoint for Spain (/api/v1/jobs/gb/search/).
2

Paginated API calls

Iterate through result pages, collecting job listings until all available results are exhausted. Each page returns up to 50 results.
3

Deduplication

Remove duplicate listings using Adzuna job IDs. Jobs with identical IDs appearing across pages are collapsed to a single row.
4

Raw CSV export

Write the deduplicated results to data/raw/scraping_jobs_raw.csv with all original API fields preserved (no cleaning at this stage).
data/raw/ is listed in .gitignore. The raw CSV generated by this notebook is not committed to the repository. The directory ships with only a .gitkeep placeholder. You must run notebook 01 (with valid credentials) or obtain the CSV separately to populate this directory.

Stage 3 — Notebook 02: Cleaning

File: notebooks/02_cleaning.ipynb Notebook 02 is the most transformation-intensive stage. It ingests all four source files, normalises each independently, then unifies them under a common schema.
PropertyValue
Inputdata/raw/*.csv (all four source files)
Outputdata/clean/*.csv — 8+ files
Key librariespandas 3.0.3, numpy 2.4.6
Process steps:
1

Column normalisation

Rename all columns to English snake_case. Map heterogeneous field names (e.g., puesto, job_title, JobTitle) to a single canonical schema.
2

Deduplication

Remove exact duplicates within each dataset and cross-dataset duplicates in the unified file. Matching is performed on title + company + location combinations.
3

Skill extraction

Parse free-text job descriptions and requirements fields using keyword matching to populate a structured skills column and generate the long-format technologies file.
4

Salary parsing

Standardise salary fields: convert ranges to midpoints, normalise to annual EUR, and flag rows with missing or unparseable salary information.
5

Location cleaning

Normalise Spanish city names, resolve abbreviations, and map entries to canonical province names for consistent geographic analysis.
6

Validation and export

Run row-count, null-rate, and schema checks. Write each cleaned dataset to data/clean/ and the unified dataset to jobs_all_clean.csv.

Stage 4 — Notebook 03: EDA

File: notebooks/03_eda.ipynb Notebook 03 performs structured exploratory analysis on the cleaned data and exports aggregated summaries for use by the visualisations notebook.
PropertyValue
Inputdata/clean/*.csv
Outputdata/eda/*.csv
Key librariespandas 3.0.3, scipy 1.17.1, statsmodels 0.14.6
Analysis blocks:
Examine dataset shape (rows, columns), data types, and memory usage. Confirm all schema constraints from the cleaning stage were applied correctly.
Compute null rates per column across all cleaned files. Identify columns with >50% missing data and document them as known limitations. Output: null rate summary added to data/eda/jobs_eda.csv.
Compute frequency distributions for categorical columns (role, location, work mode, contract type) and descriptive statistics for numeric columns (salary). Identify skew and outliers.
Aggregate technology mention counts from technologies_clean_long_format.csv and cross-reference with Stack Overflow survey data to produce ranked lists of demanded and used technologies. Output: data/clean/technology_rankings.csv.

Stage 5 — Notebook 04: Visualizations

File: notebooks/04-visualizations.ipynb Notebook 04 produces all charts for the project. Plotly is used for interactive exploration; kaleido exports each chart as a static PNG for documentation and presentations.
PropertyValue
Inputdata/eda/*.csv and selected data/clean/*.csv files
Outputimages/*.png — one file per visualisation block
Key librariesmatplotlib 3.10.9, seaborn 0.13.2, plotly 6.7.0, kaleido 1.3.0, squarify 0.4.4
Visualisation blocks:
FileChart description
00_calidad_datos.pngData quality summary — null rates and completeness by column
01_distribucion_volumen.pngVolume distribution — job count by source dataset
Additional blocksRole frequency, location heatmap, salary box plots, technology rankings, work mode breakdown, skills treemap
Plotly charts are exported to PNG using kaleido:
import plotly.io as pio

fig.write_image("images/01_distribucion_volumen.png", width=1200, height=600)

Stage 6 — Notebook 05: Bias analysis

File: notebooks/05_bias_analysis.ipynb Notebook 05 applies a structured bias-identification framework to the unified dataset. The outputs are in-notebook conclusions rather than additional CSV files.
PropertyValue
Inputdata/clean/jobs_all_clean.csv
OutputIn-notebook analytical conclusions
Key librariespandas 3.0.3, scipy 1.17.1
Bias types investigated:

Representation bias

Are certain roles (e.g., Data Engineer vs. Data Analyst) over- or under-represented relative to the actual job market? Quantified by comparing source-level frequencies.

Location bias

Do Madrid and Barcelona dominate to a degree that skews national conclusions? Measured by listing concentration ratios across provinces.

Seniority bias

Are junior, mid-level, and senior roles proportionally represented? Assessed by parsing seniority signals from job titles and descriptions.

Salary data availability bias

Does the ~69% salary disclosure rate (with ~31% missing) correlate with specific roles, locations, or sources? Analysis identifies whether missingness is random or systematic.

Data lineage table

The table below maps every input file to the output files it contributes to, tracing the complete lineage through the pipeline.
Input fileStageOutput file(s)
data_science_job_posts_2025.csvCleaning (02)jobs_clean.csv, jobs_all_clean.csv
tecnoempleo_spain_2026.csvCleaning (02)tecno_jobs_clean.csv, jobs_all_clean.csv
stackoverflow_2025_results.csvCleaning (02)technology_rankings.csv, technologies_clean_long_format.csv
scraping_jobs_raw.csv (Adzuna)Cleaning (02)jobs_all_clean.csv
jobs_all_clean.csvEDA (03)jobs_eda.csv, in-notebook statistics
technologies_clean_long_format.csvEDA (03)technology_rankings.csv
jobs_eda.csvVisualizations (04)images/00_calidad_datos.png, images/01_distribucion_volumen.png, …
technology_rankings.csvVisualizations (04)Technology ranking chart PNGs
jobs_all_clean.csvBias analysis (05)In-notebook analytical conclusions

Running individual stages

Each stage can be re-run independently as long as its input files are present. This table shows the minimum files required to start from any given stage:
Start fromMinimum files required
Stage 2 (Notebook 01)Valid .env with Adzuna credentials
Stage 3 (Notebook 02)data/raw/ — any combination of the 4 source CSVs
Stage 4 (Notebook 03)data/clean/*.csv
Stage 5 (Notebook 04)data/eda/*.csv (and selected data/clean/*.csv)
Stage 6 (Notebook 05)data/clean/jobs_all_clean.csv only
The fastest way to explore the analysis without any setup is to start at Stage 4 (Notebook 03). The data/clean/ directory is committed to the repository with all pre-processed CSVs, so no credentials or prior notebook execution are needed.

Build docs developers (and LLMs) love