Cleaned Datasets: Structure, Contents, and Row Counts

The project produces two families of CSV files: cleaned datasets under data/clean/ that are the direct output of notebook 02_cleaning.ipynb, and EDA output datasets under data/eda/ that are written by 03_eda.ipynb as analysis artifacts. Both families are version-controlled. Raw source files in data/raw/ are gitignored and must be obtained separately — see Data Sources for acquisition details.

Raw data files (data/raw/*.csv) are excluded from the repository via .gitignore. You must download or recreate them before running any notebook. The cleaned files listed here are committed and can be used directly without re-running the pipeline.

Job Offer Datasets

These files contain the cleaned and unified job posting records. They are produced by 02_cleaning.ipynb.

jobs_all_clean.csv

The primary unified dataset. All three job offer sources (df_jobs, df_tecno, df_scraping) merged into a single file with 17 standardized columns, including cleaned salary, city, and remote-work flags. 2,175 rows.

jobs_clean.csv

Cleaned version of the data_science_job_posts_2025.csv source only (df_jobs). Contains richer metadata columns such as status, headquarter, ownership, company_size, and revenue not present in the merged file. 944 rows.

tecno_jobs_clean.csv

Cleaned version of the TecnoEmpleo scrape (df_tecno). Standardized to the 11-column common schema shared with the merged dataset. High salary-null rate (~78%). 608 rows.

scraping_jobs_template.csv

An empty schema template for Adzuna API scraping results. Used by 01_data_collection.ipynb to define the expected column structure before new scraping runs are appended. Contains headers only — no data rows.

Column summary — common job offer schema

All individual source files and the merged jobs_all_clean.csv share these core columns before the cleaned/derived fields are added:

Column	Present in merged?	Description
`job_title`	✅	Normalized lowercase title
`company`	✅	Company name or anonymized ID
`location`	✅	Raw location string
`salary`	✅	Raw salary string
`job_type`	✅	Contract type
`post_date`	✅	Raw posting date
`link`	✅	URL to original offer
`skills`	✅	Python list string of skills
`industry`	✅	Industry sector
`seniority_level`	✅	junior / mid / senior / lead
`source_dataset`	✅	Origin: df_jobs / df_tecno / scraping
`salary_clean`	✅ (merged only)	Numeric mid-point salary (EUR)
`location_clean`	✅ (merged only)	Cleaned location string
`city_clean`	✅ (merged only)	Extracted city name
`is_remote`	✅ (merged only)	Boolean remote flag
`salary_clean_outlier`	✅ (merged only)	Boolean outlier flag

See the full Schema Reference for detailed per-column documentation including types, examples, and null rates.

Technology Datasets

These files are derived from the Stack Overflow 2025 Developer Survey (stackoverflow_2025_results.csv). They are generated by 02_cleaning.ipynb.

technology_rankings.csv

Combined ranking of all technologies (used and wanted) across all categories. 4 columns: category, type, technology, count. 372 rows (186 used + 186 wanted entries).

technology_rankings_used.csv

Subset of technology_rankings.csv filtered to type = used. Technologies developers reported currently using. 186 rows.

technology_rankings_wanted.csv

Subset of technology_rankings.csv filtered to type = wanted. Technologies developers reported wanting to learn. 186 rows.

technologies_clean_long_format.csv

Long-format (tidy) representation of the Stack Overflow technology data. One row per respondent–technology pair. Columns: response_id, technology, category, type. ~1.18 million rows.

stack_tech_columns_clean.csv

Wide-format Stack Overflow data with technology columns in their original survey form (semicolon-separated strings). Used as the source for generating the long-format and ranking files. Columns follow the _have_worked_with / _want_to_work_with naming convention. ~49,191 rows.

Technology categories

The category column in the rankings and long-format files uses the following values:

Category value	Description
`language`	Programming languages (Python, SQL, JavaScript…)
`database`	Databases and data stores (PostgreSQL, MongoDB…)
`platform`	Cloud and DevOps platforms (AWS, Docker, Kubernetes…)
`web_framework`	Web and API frameworks (FastAPI, React, Spring Boot…)
`development_environment`	IDEs and editors (VS Code, JupyterLab…)
`ai_model_tool`	AI/ML model tools (OpenAI GPT, Claude, Gemini…)

EDA Output Datasets

These files are written by 03_eda.ipynb during analysis. They capture intermediate results and validation snapshots and are useful for auditing or downstream reporting without re-running the full EDA notebook.

jobs_eda.csv — Enriched job offers (2,175 rows)

An extended version of jobs_all_clean.csv with four additional columns derived during EDA:

Column	Description
`job_family`	Broad role grouping (e.g. `data_science_ai`, `data_engineering`)
`work_modality`	Cleaned work modality: `remote`, `hybrid`, `onsite`, `unknown`
`post_date_parsed`	Parsed datetime version of `post_date`
`post_month`	Year-month string extracted from `post_date_parsed`

Generated by: 03_eda.ipynb

cleaning_validation_summary_eda.csv — Pipeline validation log (11 rows)

A structured log of automated data-quality checks run at the end of the cleaning pipeline. Each row represents one assertion with columns check, passed, and detail. All checks must pass (True) before EDA proceeds.Example checks include: job_id_unique, salary_clean_numeric, is_remote_boolean, expected_clean_files_exist.Generated by: 03_eda.ipynb

skill_technology_overlap_eda.csv — Cross-source overlap analysis (2 rows)

Compares the top skills extracted from job offers against the top technologies from the Stack Overflow survey. Contains columns comparison, overlap_count, and overlap_values.The two comparisons are top_job_skills_vs_top_used_technologies and top_job_skills_vs_top_wanted_technologies. An overlap count of 0 indicates vocabulary mismatches between sources (e.g. "python" vs "Python") — a known data quality consideration.Generated by: 03_eda.ipynb

technology_rankings_eda.csv / _used_eda.csv / _wanted_eda.csv — Snapshot rankings

Copies of the three technology ranking files as they appear at EDA time, preserved to ensure reproducibility of analysis even if the clean versions are regenerated. Schema is identical to their data/clean/ counterparts.Generated by: 03_eda.ipynb

File Inventory

File	Location	Rows	Notebook
`jobs_all_clean.csv`	`data/clean/`	2,175	`02_cleaning.ipynb`
`jobs_clean.csv`	`data/clean/`	944	`02_cleaning.ipynb`
`tecno_jobs_clean.csv`	`data/clean/`	608	`02_cleaning.ipynb`
`scraping_jobs_template.csv`	`data/clean/`	0 (headers only)	`01_data_collection.ipynb`
`stack_tech_columns_clean.csv`	`data/clean/`	~49,191	`02_cleaning.ipynb`
`technologies_clean_long_format.csv`	`data/clean/`	~1,176,875	`02_cleaning.ipynb`
`technology_rankings.csv`	`data/clean/`	372	`02_cleaning.ipynb`
`technology_rankings_used.csv`	`data/clean/`	186	`02_cleaning.ipynb`
`technology_rankings_wanted.csv`	`data/clean/`	186	`02_cleaning.ipynb`
`jobs_eda.csv`	`data/eda/`	2,175	`03_eda.ipynb`
`cleaning_validation_summary_eda.csv`	`data/eda/`	11	`03_eda.ipynb`
`skill_technology_overlap_eda.csv`	`data/eda/`	2	`03_eda.ipynb`
`technology_rankings_eda.csv`	`data/eda/`	372	`03_eda.ipynb`
`technology_rankings_used_eda.csv`	`data/eda/`	186	`03_eda.ipynb`
`technology_rankings_wanted_eda.csv`	`data/eda/`	186	`03_eda.ipynb`

To quickly verify that all expected files are present after cloning, load and inspect data/eda/cleaning_validation_summary_eda.csv. The row expected_clean_files_exist should show passed = True.

Introducción

Notebooks

Datos y Datasets

Análisis y Resultados

Cleaned Datasets: Structure, Contents, and Row Counts

Job Offer Datasets

jobs_all_clean.csv

jobs_clean.csv

tecno_jobs_clean.csv

scraping_jobs_template.csv

Column summary — common job offer schema

Technology Datasets

technology_rankings.csv

technology_rankings_used.csv

technology_rankings_wanted.csv

technologies_clean_long_format.csv

stack_tech_columns_clean.csv

Technology categories

EDA Output Datasets

File Inventory

Build docs developers (and LLMs) love

Introducción

Notebooks

Datos y Datasets

Análisis y Resultados

Documentation Index

​Job Offer Datasets

jobs_all_clean.csv

jobs_clean.csv

tecno_jobs_clean.csv

scraping_jobs_template.csv

​Column summary — common job offer schema

​Technology Datasets

technology_rankings.csv

technology_rankings_used.csv

technology_rankings_wanted.csv

technologies_clean_long_format.csv

stack_tech_columns_clean.csv

​Technology categories

​EDA Output Datasets

​File Inventory

Build docs developers (and LLMs) love

Job Offer Datasets

Column summary — common job offer schema

Technology Datasets

Technology categories

EDA Output Datasets

File Inventory