Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/Gema-Villanueva/proyecto-eda-roles-datos/llms.txt

Use this file to discover all available pages before exploring further.

The cleaning notebook (02_cleaning.ipynb) is the connective tissue of the project. Its single responsibility is to take four raw, structurally different datasets and produce a family of clean, consistently named files that every downstream notebook can load without extra wrangling. The pipeline is written by Ele and covers everything from path resolution and column normalisation through salary parsing, location cleaning, and a suite of automated validation checks.

Datasets processed

df_jobs

data_science_job_posts_2025.csv — 944 international data-science offers with skills and salary. Main source for the unified dataset.

df_tecno

tecnoempleo_spain_2026.csv — 600 Spanish tech offers with Spanish column names. Requires full column rename before it can be joined.

df_stack

stackoverflow_2025_results.csv — Annual survey responses. Technology columns use PascalCase names that must be converted to snake_case.

df_scraping

scraping_jobs_raw.csv — Offers collected via the Adzuna API. May not exist yet; the notebook creates an empty template if absent.

Path setup

The notebook resolves the project root dynamically so it runs correctly whether it is launched from the notebooks/ subdirectory or the project root:
from pathlib import Path

# Resolve project root regardless of working directory
# Resolver raíz del proyecto independientemente del directorio actual
PROJECT_ROOT = Path.cwd().parent if Path.cwd().name == "notebooks" else Path.cwd()

DATA_RAW   = PROJECT_ROOT / "data" / "raw"
DATA_CLEAN = PROJECT_ROOT / "data" / "clean"

DATA_RAW.mkdir(parents=True, exist_ok=True)
DATA_CLEAN.mkdir(parents=True, exist_ok=True)
All output files land in data/clean/. The DATA_CLEAN.mkdir(parents=True, exist_ok=True) call ensures the folder is created if it does not already exist, making the notebook safe to run on a fresh clone.

Naming conventions

The project enforces a consistent style across all produced artefacts:
ElementConvention
Column namesEnglish snake_case only
Markdown cellsSpanish (project language)
Code commentsBilingual — Spanish first, English translation after /
Variable and function namesEnglish throughout

Processing blocks

Loads pandas, numpy, os, re, and ast. Sets display.max_columns and display.max_rows. Creates DATA_CLEAN directory if absent.
Defines reusable helpers used across all three datasets:
  • normalize_column_names(df) — strips accents, lowercases, converts spaces and special characters to underscores.
  • apply_column_mapping(df, mapping) — renames columns from a dictionary, skipping keys that do not exist.
  • add_missing_columns(df, required_columns) — pads a DataFrame with NaN columns for any required field that is absent.
  • clean_text(value) / clean_text_columns(df) — trims whitespace, converts empty strings and "nan"/"null"/"none" literals to NaN.
Also defines the job_standard_columns list and the two column-mapping dictionaries (see below).
Checks for data/raw/scraping_jobs_raw.csv. If found, normalises and maps its columns to the standard schema and saves scraping_jobs_clean.csv. If not found, writes an empty file with the correct headers so downstream notebooks do not break.
Reads all four source CSVs from DATA_RAW into df_jobs, df_tecno, df_stack, and df_scraping. Prints shapes and shows a comparative quality summary (rows, columns, duplicate rows, missing-cell count and percentage).
Each dataset gets its own cleaning pass:
  • df_jobs → text cleaning, deduplication, standard columns added.
  • df_tecnonormalize_column_names + tecno_column_mapping rename (see table below), then text cleaning.
  • df_stackstack_technology_column_mapping rename, technology columns split from wide to long format.
Vertically concatenates df_jobs_clean, df_tecno_clean, and scraping_jobs_clean (all sharing the same job_standard_columns schema) into jobs_all_clean. Assigns a unique job_id to every row and records source_dataset so the origin of each offer is traceable.Result: 2,167 offers, 12 columns.
  • Explodes multi-value skills strings into job_skills_long.csv (one row per offer + skill), linked by job_id.
  • Parses salary strings (e.g. "35000€ - 45000€") into a numeric salary_clean column; flags statistical outliers in salary_clean_outlier.
  • Derives location_clean, city_clean, and boolean is_remote from the raw location field.
Runs 11 automated checks covering required columns, non-empty assertion, uniqueness of job_id, snake_case compliance of Stack Overflow tech columns, response_id presence in technology datasets, boolean type of is_remote, numeric type of salary_clean, expected columns in job_skills_long, scraping integration consistency, and presence of all expected output files. Results are saved to cleaning_validation_summary.csv.

Column standardisation maps

TecnoEmpleo → standard schema

Original (Spanish)Standardised (English snake_case)
titulojob_title
empresacompany
salariosalary
ubicacionlocation
tipo_de_trabajojob_type
fecha_de_publicacionpost_date
enlacelink

Stack Overflow → snake_case

Original (PascalCase)Standardised (snake_case)
ResponseIdresponse_id
LanguageHaveWorkedWithlanguage_have_worked_with
LanguageWantToWorkWithlanguage_want_to_work_with
DatabaseHaveWorkedWithdatabase_have_worked_with
DatabaseWantToWorkWithdatabase_want_to_work_with
PlatformHaveWorkedWithplatform_have_worked_with
PlatformWantToWorkWithplatform_want_to_work_with
WebframeHaveWorkedWithweb_framework_have_worked_with
WebframeWantToWorkWithweb_framework_want_to_work_with
DevEnvsHaveWorkedWithdevelopment_environment_have_worked_with
DevEnvsWantToWorkWithdevelopment_environment_want_to_work_with
AIModelsHaveWorkedWithai_model_have_worked_with
AIModelsWantToWorkWithai_model_want_to_work_with

Output files in data/clean/

FileRowsKey columnsPurpose
jobs_clean.csv944job_title, company, location, salary, skills, source_datasetCleaned original job postings from df_jobs
tecno_jobs_clean.csv600job_title, company, location, salary, job_type, post_date, linkTecnoEmpleo offers with standardised English column names
scraping_jobs_clean.csvvariesall job_standard_columnsAdzuna-scraped offers adapted to the common schema
scraping_jobs_template.csv0all job_standard_columnsEmpty header-only template written to data/clean/ whether or not raw scraping data exists
jobs_all_clean.csv2,167job_id, job_title, company, location, salary_clean, city_clean, is_remoteUnified dataset — primary input for EDA and visualisations
job_skills_long.csvvariesjob_id, job_title, source_dataset, skillOne row per offer × skill — used for skill-frequency and gap analysis
stack_tech_columns_clean.csvsurvey rowsresponse_id + tech columns in snake_caseReduced Stack Overflow base for technology analysis
technologies_clean_long_format.csvvariesresponse_id, technology, category, typeOne row per respondent × technology — used and wanted
technology_rankings.csvvariescategory, type, technology, countFull ranking of technologies used and wanted
technology_rankings_used.csv15+technology, countRanking of technologies respondents have worked with
technology_rankings_wanted.csv15+technology, countRanking of technologies respondents want to work with
clean_datasets_dictionary.csvfile, rows, main_columns, intended_useSelf-documenting index of all clean files
cleaning_validation_summary.csv11check, passed, detailAutomated validation results for downstream verification
jobs_all_clean.csv is assembled by a vertical concatenation (pd.concat), not a JOIN. There is no reliable common key between job offers and Stack Overflow responses, so the two families of datasets remain separate. Comparisons between offer skills and Stack Overflow technology preferences are aggregate indicators, not row-level matches.

Known data limitations

salary_clean is parsed from free-text salary strings that may represent ranges, annual figures, or monthly figures depending on the source. It is suitable for exploratory analysis but should not be treated as an exact salary value.
The is_remote boolean and city_clean field are derived from text pattern matching on the raw location column. Ambiguous values (e.g. “Hybrid – Spain”) may be classified incorrectly in edge cases.
The Stack Overflow survey reflects self-reported technology experience from a global community of developers who chose to take the survey. It does not directly represent the Spanish job market and should only be used for directional technology-preference analysis.
TecnoEmpleo offers rarely include structured skill lists. job_skills_long.csv is therefore dominated by data from df_jobs (the international dataset), which consistently includes a skills column.

Build docs developers (and LLMs) love