Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/Gema-Villanueva/proyecto-eda-roles-datos/llms.txt

Use this file to discover all available pages before exploring further.

The EDA notebook (03_eda.ipynb) is explicitly positioned as a diagnostic phase, not an inferential one. Its job is to reveal the shape of the data, surface patterns, and flag limitations — not to draw final conclusions. Everything produced here feeds directly into the visualisation and bias-analysis notebooks that follow. The primary dataset is jobs_all_clean.csv, the 2,167-offer unified file produced by the cleaning pipeline. Several auxiliary datasets are loaded alongside it to allow cross-source comparisons.

Imports and configuration

import os
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display

# Configure pandas and plots / Configurar pandas y gráficos
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 120)
sns.set_theme(style="whitegrid", palette="Set2")

# Define project paths / Definir rutas del proyecto
PROJECT_ROOT = Path.cwd().parent if Path.cwd().name == "notebooks" else Path.cwd()
DATA_CLEAN   = PROJECT_ROOT / "data" / "clean"
The ensure_location_columns() helper reconstructs location_clean, city_clean, and is_remote in memory if the loaded CSV was produced by a partial run of the cleaning notebook. This makes the EDA resilient to incremental execution.

Datasets loaded

VariableFileRole in EDA
jobs_all_cleanjobs_all_clean.csvMain dataset — all analyses run on this
jobs_cleanjobs_clean.csvCross-check for original offer structure
tecno_jobs_cleantecno_jobs_clean.csvSpanish market cross-check
job_skills_longjob_skills_long.csvSkill-frequency analysis
technology_rankingstechnology_rankings.csvStack Overflow technology overview
technology_rankings_usedtechnology_rankings_used.csvTechnologies respondents use
technology_rankings_wantedtechnology_rankings_wanted.csvTechnologies respondents want
cleaning_validation_summarycleaning_validation_summary.csvValidates cleaning quality before EDA begins
If any load fails, run 02_cleaning.ipynb first to regenerate the clean CSV files. The EDA notebook does not re-clean data — it assumes the cleaning pipeline has already been executed.

Analysis sections

The notebook works through 20 numbered sections. The table below summarises the purpose of each:
SectionTopicWhat is examined
1ObjectiveResearch questions and dataset scope
2Imports and configLibraries, pandas options, path setup
3Load clean datasetsAll eight CSV files loaded and validated
4Auxiliary functionsReusable helpers for summaries, nulls, frequency tables, and plots
5Structure reviewRow counts, column lists, and data types for every dataset
6Main dataset overviewjobs_all_clean dimensions, columns, and first rows
7Null valuesPer-column null count and percentage; horizontal bar chart
8Source coverageOffer count and share by source_dataset
9Role analysisTop job_title values; approximate role-family classification
10Company analysisMost frequent companies; concentration check
11Location and modalityTop cities by city_clean; work_modality distribution
12Seniority and industryseniority_level and industry distributions
13Salary analysisAvailability rate; descriptive stats; distribution by source and modality
14Job skillsTop 25 skills from job_skills_long; breakdown by source
15Stack Overflow techUsed vs wanted technology rankings; category breakdown
16Skills ↔ Tech comparisonName-normalised overlap between job-offer skills and SO technology rankings
17Posting datesDate parsing; monthly posting trend by source
18LimitationsStructural caveats documented before conclusions
19Initial findingsAuto-generated summary of key metrics
20Post-EDA exportEnriched datasets saved to data/eda/ for the visualisations notebook

Key findings

Dataset size

2,167 offers, 17 columns in the unified dataset as loaded by the EDA notebook. The working EDA copy gains additional derived columns (e.g. post_date_parsed, post_month, work_modality, job_family).

Largest source

df_jobs contributes 942 records (43.47 % of the unified dataset), making it the dominant source. Results must be interpreted with this weight in mind.

Most frequent role family

data_science_ai is the most common job family derived from job_title classification. This reflects both the composition of the source datasets and genuine market demand.

Most frequent city

Madrid ranks first among city_clean values, consistent with its position as Spain’s primary tech-employment hub.

Dominant modality

unknown is the most frequent work_modality value — many offers simply do not specify remote, hybrid, or on-site. This is itself a finding relevant to the bias analysis.

Salary availability

salary_clean is available for 50.95 % of offers. The remaining ~49 % are structurally absent and should be treated as missing at random until proven otherwise.

Top skill

python is the most frequently mentioned skill in job_skills_long, appearing more often than SQL, which ranks second.

Most wanted technology (SO)

openAI GPT (chatbot models) tops the ai_model_tool category in technology_rankings_wanted. This reflects professional appetite for generative-AI tools rather than market job demand.

Role-family classification

The notebook derives a job_family column from job_title using keyword pattern matching. This is an approximation intended to group similar-sounding titles (e.g. “Data Scientist Sr.”, “Senior Data Scientist”, “Sr. Data Scientist”) under a single label for aggregate analysis. It does not replace a formal job taxonomy.
Role classification is based on simple text rules. Unusual or compound job titles may be classified into the other bucket. Always check raw job_title frequencies alongside job_family distributions to avoid over-aggregating.

Salary analysis approach

# Salary availability summary / Resumen de disponibilidad salarial
salary_availability = pd.DataFrame({
    "metric": [
        "total_jobs",
        "jobs_with_salary_clean",
        "jobs_without_salary_clean",
    ],
    "count": [
        len(jobs_eda),
        jobs_eda["salary_clean"].notna().sum(),
        jobs_eda["salary_clean"].isna().sum(),
    ],
})
salary_availability["pct"] = (
    salary_availability["count"] / len(jobs_eda) * 100
).round(2)
Salary is analysed both as a raw distribution and broken down by source_dataset and work_modality. Because salary_clean was parsed from heterogeneous text formats during cleaning, all salary figures are treated as approximations for exploratory purposes only.

Skills vs Stack Overflow technology overlap

A normalised name-matching comparison is run between the top 25 job-offer skills and the top 50 used/wanted technologies from Stack Overflow. The overlap count gives a rough signal of alignment between what employers list in offers and what professionals report using or wanting to learn. Due to naming inconsistencies across sources (e.g. "powerbi" vs "Power BI"), the overlap is conservative.

Post-EDA exports

At the end of the notebook, enriched datasets are saved to data/eda/ so that 04-visualizations.ipynb consumes the validated, in-memory-enriched versions:
Export fileContents
jobs_eda.csvjobs_all_clean plus all derived EDA columns
technology_rankings_eda.csvFull technology ranking
technology_rankings_used_eda.csvUsed-technology ranking
technology_rankings_wanted_eda.csvWanted-technology ranking
cleaning_validation_summary_eda.csvValidation summary with any EDA-level fixes applied
skill_technology_overlap_eda.csvSkill ↔ technology overlap comparison table
The visualisations notebook (04-visualizations.ipynb) looks for these files in data/eda/ first and falls back to data/clean/ if they are absent. Running the EDA notebook before the visualisations notebook is therefore recommended but not strictly required.

Build docs developers (and LLMs) love