Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/Gema-Villanueva/proyecto-eda-roles-datos/llms.txt

Use this file to discover all available pages before exploring further.

The visualisations notebook (04-visualizations.ipynb) is the deliverable-facing layer of the project. It was produced for the client DataTalent Solutions S.L. and translates the cleaned, EDA-enriched datasets into publication-quality charts, statistical tests, and interactive dashboards. The notebook is version 2.0 and is designed to consume the post-EDA data from data/eda/, with an automatic fallback that generates a realistic simulated dataset if no real data is available.

Libraries

import os, warnings
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import matplotlib.gridspec as gridspec
from matplotlib.colors import LinearSegmentedColormap
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from scipy import stats
from scipy.stats import kruskal, mannwhitneyu

try:
    import squarify           # treemap charts — optional
    HAS_SQUARIFY = True
except ImportError:
    HAS_SQUARIFY = False

try:
    import statsmodels.api as sm   # OLS regression + Q-Q plots — optional
    HAS_STATSMODELS = True
except ImportError:
    HAS_STATSMODELS = False
squarify and statsmodels are optional. If squarify is not installed, treemap charts fall back to horizontal bar charts. If statsmodels is absent, OLS trendlines in Plotly scatter plots are disabled and scipy is used for Q-Q plots instead.

Data loading and fallback behaviour

The notebook uses a priority-resolution strategy for loading data:
1

Look for post-EDA data

Checks data/eda/jobs_eda.csv (produced by 03_eda.ipynb). If found, uses the validated, in-memory-enriched version of the dataset.
2

Fall back to clean data

If data/eda/ does not exist or the file is missing, tries data/clean/jobs_all_clean.csv instead and prints a warning recommending that 03_eda.ipynb be run first.
3

Generate a simulated dataset

If neither location has data, simular_dataset(n=600, seed=42) generates a 600-record synthetic dataset with realistic distributions of roles, cities, modalities, seniority levels, sectors, salary ranges, and binary skill columns. This allows the notebook to run end-to-end for development and demo purposes without any real data.
def simular_dataset(n=600, seed=42):
    np.random.seed(seed)
    roles    = ['Data Analyst', 'Data Scientist', 'Data Engineer',
                'BI Analyst', 'ML Engineer']
    ciudades = ['Madrid', 'Barcelona', 'Remoto', 'Valencia',
                'Sevilla', 'Bilbao', 'Málaga']
    modos    = ['Presencial', 'Híbrido', 'Remoto']
    # ... salary ranges vary by seniority (Junior 18-32k, Mid 30-52k, Senior 48-85k)
    return df
When DATOS_REALES = False, all charts display the simulated data. Figures and statistics in the exported PNG files will not reflect the real job market. Always check the console output for the DATOS_REALES flag before interpreting exported images.

Analysis blocks

BlockTitleContentOutput image
0Configuration & Data QualityCorporate palette setup, null heatmap, data-type distribution00_calidad_datos.png
1Vacancy Distribution & VolumeOffers by city (bar), work modality (donut), seniority (bar); treemap of roles per city01_distribucion_volumen.png, 01b_treemap_roles_ciudad.png
2Salary & Compensation AnalysisBoxplot by seniority (Kruskal-Wallis test), violin by modality, salary by city (± σ), salary histogram with normal curve02_analisis_salarial.png, 02b_salario_rol.png
3Heatmaps & CorrelationsSkill co-occurrence heatmap, skill×seniority heatmap, numerical correlation matrix03_heatmaps_correlaciones.png
4Tech Stack & Used vs Wanted GapLollipop Used vs Wanted gap chart, stacked bar by technology category, dot plot of most demanded skills04_tecnologias.png
5Advanced Statistical AnalysisQ-Q plots, OLS regression (experience → salary), Mann-Whitney U test (remote vs on-site salary), percentile scatter by role05_estadistica_avanzada.png
6Interactive Visualisations (Plotly)Box chart: salary × seniority × role; scatter: experience vs salary with OLS trendline; sunburst: city → modality → role; interactive heatmap: sector × city salary(rendered in notebook)
7Executive Summary & Export9-KPI panel card grid; file inventory07_panel_kpis.png

Block details

Establishes the corporate colour palette (PALETA) used consistently across all charts:
KeyHexUsage
primary#1A365DMain bars, titles, annotations
secondary#2B6CB0Secondary bars, pie slices
accent#4299E1Highlights, IQR markers
warm#ED8936Salary histograms, error bars
success#48BB78Junior category, positive indicators
muted#A0AEC0Subdued labels
The quality dashboard shows a null-value heatmap and a pie chart of column data types. Output: 00_calidad_datos.png.
Three charts in a single figure:
  1. Top 8 cities — horizontal bar chart of offer count per city_clean.
  2. Work modality — donut chart showing percentage split of remote_modality (Presencial / Híbrido / Remoto).
  3. Seniority — annotated bar chart with count and percentage per level (Junior / Mid / Senior).
A fourth chart — the treemap of roles per top-4 city — is produced separately as 01b_treemap_roles_ciudad.png. If squarify is not installed, this falls back to a horizontal bar chart.
All salary charts exclude statistical outliers (flagged in salary_clean_outlier). The four sub-charts are:
  • Boxplot by seniority — includes a Kruskal-Wallis p-value annotation to show whether salary differences between Junior/Mid/Senior are statistically significant.
  • Violin by modality — shows the full salary distribution shape for each work modality.
  • Mean salary by city (±σ) — horizontal bar chart with standard-deviation error bars; only cities with ≥ 5 salary observations are included.
  • Salary histogram — overlays a normal-distribution curve, median line, and mean line for visual skewness assessment.
A separate chart (02b_salario_rol.png) adds a strip-plot overlay on top of boxplots grouped by job_title.
Skill co-occurrence and correlation charts help identify which technical skills tend to appear together in job offers and which skills correlate with higher salary levels. Requires binary skill columns (present in simulated data; derived from job_skills_long for real data).
Loads technology_rankings_used and technology_rankings_wanted (from data/eda/ or data/clean/). The lollipop gap chart shows technologies where the “wanted” percentage exceeds the “used” percentage — a positive gap indicates growing professional demand that has not yet been fully adopted. A positive gap for AWS and Docker suggests emerging cloud and containerisation demand.
Includes four statistical charts (graphs 18–21):
  • Q-Q plots — assess salary normality per seniority group.
  • OLS regressionexperience_years as predictor of salary_clean; rendered with statsmodels if available, otherwise disabled.
  • Mann-Whitney U test — compares remote vs on-site salary distributions without assuming normality.
  • Percentile scatter by role — shows P10, P25, median, P75, P90 salary ranges per job_title for a compact inter-role salary comparison.
Four fully interactive charts rendered directly in the notebook (not exported as PNG):
  • Chart 22px.box: salary distribution by seniority, coloured by role. Supports zoom and hover.
  • Chart 23px.scatter: experience vs salary, coloured by seniority, with OLS trendline (requires statsmodels). Hover shows job title and city.
  • Chart 24px.sunburst: hierarchical breakdown city → modality → role for the top 6 cities.
  • Chart 25go.Heatmap: median salary by sector and city, with in-cell text labels.
Generates a 3×3 KPI panel card grid (07_panel_kpis.png) on a dark background with nine business-critical metrics:
KPIDescription
Total OfertasTotal offer count
Salario MedianoMedian salary (outliers excluded)
Salario MedioMean salary (outliers excluded)
Top CiudadCity with most offers
Modalidad + frecuenteMost common work modality
% Remoto/HíbridoPercentage of remote or hybrid offers
Rol más demandadoMost frequent job title
P10 Salarial10th percentile salary
P90 Salarial90th percentile salary
At the end of the block, a file inventory lists every PNG in images/ with its size in KB.

Exported chart files

All PNG files are saved to images/ at 200 DPI:
FileContents
00_calidad_datos.pngNull heatmap and data-type distribution
01_distribucion_volumen.pngCity bar chart, modality donut, seniority bars
01b_treemap_roles_ciudad.pngTreemap (or fallback bar chart) of roles per city
02_analisis_salarial.pngSalary boxplot, violin, city bars, histogram
02b_salario_rol.pngSalary boxplot + strip by job title
03_heatmaps_correlaciones.pngSkill co-occurrence and correlation heatmaps
04_tecnologias.pngUsed vs Wanted gap lollipop and technology stacks
05_estadistica_avanzada.pngQ-Q plots, OLS regression, Mann-Whitney, percentiles
07_panel_kpis.pngExecutive KPI panel (dark background)

Analytical conclusions

Madrid & Barcelona dominate

These two cities account for the majority of tech offers, with Madrid in first place across all sources.

Hybrid is the new standard

Hybrid modality is the most common across the dataset, surpassing both fully remote and on-site.

Python + SQL lead skills

Python and SQL have the highest penetration in job offers. Cloud skills (AWS, Docker) show a positive Used→Wanted gap, signalling rising demand.

Salary differences are significant

Kruskal-Wallis tests confirm statistically significant salary differences between seniority levels. Remote roles trend slightly higher in median salary.
All analysis was produced for DataTalent Solutions S.L. using data from 02_cleaning.ipynb. The notebook is self-contained and reproducible: run the cells in order after placing the clean CSVs in data/clean/ or the post-EDA exports in data/eda/.

Build docs developers (and LLMs) love