The cleaning notebook (Documentation Index
Fetch the complete documentation index at: https://mintlify.com/Gema-Villanueva/proyecto-eda-roles-datos/llms.txt
Use this file to discover all available pages before exploring further.
02_cleaning.ipynb) is the connective tissue of the project. Its single responsibility is to take four raw, structurally different datasets and produce a family of clean, consistently named files that every downstream notebook can load without extra wrangling. The pipeline is written by Ele and covers everything from path resolution and column normalisation through salary parsing, location cleaning, and a suite of automated validation checks.
Datasets processed
df_jobs
data_science_job_posts_2025.csv — 944 international data-science offers with skills and salary. Main source for the unified dataset.df_tecno
tecnoempleo_spain_2026.csv — 600 Spanish tech offers with Spanish column names. Requires full column rename before it can be joined.df_stack
stackoverflow_2025_results.csv — Annual survey responses. Technology columns use PascalCase names that must be converted to snake_case.df_scraping
scraping_jobs_raw.csv — Offers collected via the Adzuna API. May not exist yet; the notebook creates an empty template if absent.Path setup
The notebook resolves the project root dynamically so it runs correctly whether it is launched from thenotebooks/ subdirectory or the project root:
data/clean/. The DATA_CLEAN.mkdir(parents=True, exist_ok=True) call ensures the folder is created if it does not already exist, making the notebook safe to run on a fresh clone.Naming conventions
The project enforces a consistent style across all produced artefacts:| Element | Convention |
|---|---|
| Column names | English snake_case only |
| Markdown cells | Spanish (project language) |
| Code comments | Bilingual — Spanish first, English translation after / |
| Variable and function names | English throughout |
Processing blocks
Block 1 — Imports and initial configuration
Block 1 — Imports and initial configuration
pandas, numpy, os, re, and ast. Sets display.max_columns and display.max_rows. Creates DATA_CLEAN directory if absent.Block 2 — General cleaning functions
Block 2 — General cleaning functions
normalize_column_names(df)— strips accents, lowercases, converts spaces and special characters to underscores.apply_column_mapping(df, mapping)— renames columns from a dictionary, skipping keys that do not exist.add_missing_columns(df, required_columns)— pads a DataFrame withNaNcolumns for any required field that is absent.clean_text(value)/clean_text_columns(df)— trims whitespace, converts empty strings and"nan"/"null"/"none"literals toNaN.
job_standard_columns list and the two column-mapping dictionaries (see below).Block 3 — Scraping template and dataset
Block 3 — Scraping template and dataset
data/raw/scraping_jobs_raw.csv. If found, normalises and maps its columns to the standard schema and saves scraping_jobs_clean.csv. If not found, writes an empty file with the correct headers so downstream notebooks do not break.Block 4 — Dataset loading
Block 4 — Dataset loading
DATA_RAW into df_jobs, df_tecno, df_stack, and df_scraping. Prints shapes and shows a comparative quality summary (rows, columns, duplicate rows, missing-cell count and percentage).Block 5 — Per-source cleaning
Block 5 — Per-source cleaning
- df_jobs → text cleaning, deduplication, standard columns added.
- df_tecno →
normalize_column_names+tecno_column_mappingrename (see table below), then text cleaning. - df_stack →
stack_technology_column_mappingrename, technology columns split from wide to long format.
Block 6 — Offer unification
Block 6 — Offer unification
df_jobs_clean, df_tecno_clean, and scraping_jobs_clean (all sharing the same job_standard_columns schema) into jobs_all_clean. Assigns a unique job_id to every row and records source_dataset so the origin of each offer is traceable.Result: 2,167 offers, 12 columns.Block 7 — Skills, salaries, and locations
Block 7 — Skills, salaries, and locations
- Explodes multi-value
skillsstrings intojob_skills_long.csv(one row per offer + skill), linked byjob_id. - Parses salary strings (e.g.
"35000€ - 45000€") into a numericsalary_cleancolumn; flags statistical outliers insalary_clean_outlier. - Derives
location_clean,city_clean, and booleanis_remotefrom the raw location field.
Block 8 — Final validations
Block 8 — Final validations
job_id, snake_case compliance of Stack Overflow tech columns, response_id presence in technology datasets, boolean type of is_remote, numeric type of salary_clean, expected columns in job_skills_long, scraping integration consistency, and presence of all expected output files. Results are saved to cleaning_validation_summary.csv.Column standardisation maps
TecnoEmpleo → standard schema
| Original (Spanish) | Standardised (English snake_case) |
|---|---|
titulo | job_title |
empresa | company |
salario | salary |
ubicacion | location |
tipo_de_trabajo | job_type |
fecha_de_publicacion | post_date |
enlace | link |
Stack Overflow → snake_case
| Original (PascalCase) | Standardised (snake_case) |
|---|---|
ResponseId | response_id |
LanguageHaveWorkedWith | language_have_worked_with |
LanguageWantToWorkWith | language_want_to_work_with |
DatabaseHaveWorkedWith | database_have_worked_with |
DatabaseWantToWorkWith | database_want_to_work_with |
PlatformHaveWorkedWith | platform_have_worked_with |
PlatformWantToWorkWith | platform_want_to_work_with |
WebframeHaveWorkedWith | web_framework_have_worked_with |
WebframeWantToWorkWith | web_framework_want_to_work_with |
DevEnvsHaveWorkedWith | development_environment_have_worked_with |
DevEnvsWantToWorkWith | development_environment_want_to_work_with |
AIModelsHaveWorkedWith | ai_model_have_worked_with |
AIModelsWantToWorkWith | ai_model_want_to_work_with |
Output files in data/clean/
| File | Rows | Key columns | Purpose |
|---|---|---|---|
jobs_clean.csv | 944 | job_title, company, location, salary, skills, source_dataset | Cleaned original job postings from df_jobs |
tecno_jobs_clean.csv | 600 | job_title, company, location, salary, job_type, post_date, link | TecnoEmpleo offers with standardised English column names |
scraping_jobs_clean.csv | varies | all job_standard_columns | Adzuna-scraped offers adapted to the common schema |
scraping_jobs_template.csv | 0 | all job_standard_columns | Empty header-only template written to data/clean/ whether or not raw scraping data exists |
jobs_all_clean.csv | 2,167 | job_id, job_title, company, location, salary_clean, city_clean, is_remote | Unified dataset — primary input for EDA and visualisations |
job_skills_long.csv | varies | job_id, job_title, source_dataset, skill | One row per offer × skill — used for skill-frequency and gap analysis |
stack_tech_columns_clean.csv | survey rows | response_id + tech columns in snake_case | Reduced Stack Overflow base for technology analysis |
technologies_clean_long_format.csv | varies | response_id, technology, category, type | One row per respondent × technology — used and wanted |
technology_rankings.csv | varies | category, type, technology, count | Full ranking of technologies used and wanted |
technology_rankings_used.csv | 15+ | technology, count | Ranking of technologies respondents have worked with |
technology_rankings_wanted.csv | 15+ | technology, count | Ranking of technologies respondents want to work with |
clean_datasets_dictionary.csv | — | file, rows, main_columns, intended_use | Self-documenting index of all clean files |
cleaning_validation_summary.csv | 11 | check, passed, detail | Automated validation results for downstream verification |
Known data limitations
Salary is an approximation
Salary is an approximation
salary_clean is parsed from free-text salary strings that may represent ranges, annual figures, or monthly figures depending on the source. It is suitable for exploratory analysis but should not be treated as an exact salary value.Remote detection is heuristic
Remote detection is heuristic
is_remote boolean and city_clean field are derived from text pattern matching on the raw location column. Ambiguous values (e.g. “Hybrid – Spain”) may be classified incorrectly in edge cases.Stack Overflow is not a job-market sample
Stack Overflow is not a job-market sample
Skills coverage depends on source
Skills coverage depends on source
job_skills_long.csv is therefore dominated by data from df_jobs (the international dataset), which consistently includes a skills column.