The project produces two families of CSV files: cleaned datasets underDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/Gema-Villanueva/proyecto-eda-roles-datos/llms.txt
Use this file to discover all available pages before exploring further.
data/clean/ that are the direct output of notebook 02_cleaning.ipynb, and EDA output datasets under data/eda/ that are written by 03_eda.ipynb as analysis artifacts. Both families are version-controlled. Raw source files in data/raw/ are gitignored and must be obtained separately — see Data Sources for acquisition details.
Raw data files (
data/raw/*.csv) are excluded from the repository via .gitignore. You must download or recreate them before running any notebook. The cleaned files listed here are committed and can be used directly without re-running the pipeline.Job Offer Datasets
These files contain the cleaned and unified job posting records. They are produced by02_cleaning.ipynb.
jobs_all_clean.csv
The primary unified dataset. All three job offer sources (df_jobs, df_tecno, df_scraping) merged into a single file with 17 standardized columns, including cleaned salary, city, and remote-work flags. 2,175 rows.
jobs_clean.csv
Cleaned version of the
data_science_job_posts_2025.csv source only (df_jobs). Contains richer metadata columns such as status, headquarter, ownership, company_size, and revenue not present in the merged file. 944 rows.tecno_jobs_clean.csv
Cleaned version of the TecnoEmpleo scrape (df_tecno). Standardized to the 11-column common schema shared with the merged dataset. High salary-null rate (~78%). 608 rows.
scraping_jobs_template.csv
An empty schema template for Adzuna API scraping results. Used by
01_data_collection.ipynb to define the expected column structure before new scraping runs are appended. Contains headers only — no data rows.Column summary — common job offer schema
All individual source files and the mergedjobs_all_clean.csv share these core columns before the cleaned/derived fields are added:
| Column | Present in merged? | Description |
|---|---|---|
job_title | ✅ | Normalized lowercase title |
company | ✅ | Company name or anonymized ID |
location | ✅ | Raw location string |
salary | ✅ | Raw salary string |
job_type | ✅ | Contract type |
post_date | ✅ | Raw posting date |
link | ✅ | URL to original offer |
skills | ✅ | Python list string of skills |
industry | ✅ | Industry sector |
seniority_level | ✅ | junior / mid / senior / lead |
source_dataset | ✅ | Origin: df_jobs / df_tecno / scraping |
salary_clean | ✅ (merged only) | Numeric mid-point salary (EUR) |
location_clean | ✅ (merged only) | Cleaned location string |
city_clean | ✅ (merged only) | Extracted city name |
is_remote | ✅ (merged only) | Boolean remote flag |
salary_clean_outlier | ✅ (merged only) | Boolean outlier flag |
Technology Datasets
These files are derived from the Stack Overflow 2025 Developer Survey (stackoverflow_2025_results.csv). They are generated by 02_cleaning.ipynb.
technology_rankings.csv
Combined ranking of all technologies (used and wanted) across all categories. 4 columns:
category, type, technology, count. 372 rows (186 used + 186 wanted entries).technology_rankings_used.csv
Subset of
technology_rankings.csv filtered to type = used. Technologies developers reported currently using. 186 rows.technology_rankings_wanted.csv
Subset of
technology_rankings.csv filtered to type = wanted. Technologies developers reported wanting to learn. 186 rows.technologies_clean_long_format.csv
Long-format (tidy) representation of the Stack Overflow technology data. One row per respondent–technology pair. Columns:
response_id, technology, category, type. ~1.18 million rows.stack_tech_columns_clean.csv
Wide-format Stack Overflow data with technology columns in their original survey form (semicolon-separated strings). Used as the source for generating the long-format and ranking files. Columns follow the
_have_worked_with / _want_to_work_with naming convention. ~49,191 rows.Technology categories
Thecategory column in the rankings and long-format files uses the following values:
| Category value | Description |
|---|---|
language | Programming languages (Python, SQL, JavaScript…) |
database | Databases and data stores (PostgreSQL, MongoDB…) |
platform | Cloud and DevOps platforms (AWS, Docker, Kubernetes…) |
web_framework | Web and API frameworks (FastAPI, React, Spring Boot…) |
development_environment | IDEs and editors (VS Code, JupyterLab…) |
ai_model_tool | AI/ML model tools (OpenAI GPT, Claude, Gemini…) |
EDA Output Datasets
These files are written by03_eda.ipynb during analysis. They capture intermediate results and validation snapshots and are useful for auditing or downstream reporting without re-running the full EDA notebook.
jobs_eda.csv — Enriched job offers (2,175 rows)
jobs_eda.csv — Enriched job offers (2,175 rows)
An extended version of
Generated by:
jobs_all_clean.csv with four additional columns derived during EDA:| Column | Description |
|---|---|
job_family | Broad role grouping (e.g. data_science_ai, data_engineering) |
work_modality | Cleaned work modality: remote, hybrid, onsite, unknown |
post_date_parsed | Parsed datetime version of post_date |
post_month | Year-month string extracted from post_date_parsed |
03_eda.ipynbcleaning_validation_summary_eda.csv — Pipeline validation log (11 rows)
cleaning_validation_summary_eda.csv — Pipeline validation log (11 rows)
A structured log of automated data-quality checks run at the end of the cleaning pipeline. Each row represents one assertion with columns
check, passed, and detail. All checks must pass (True) before EDA proceeds.Example checks include: job_id_unique, salary_clean_numeric, is_remote_boolean, expected_clean_files_exist.Generated by: 03_eda.ipynbskill_technology_overlap_eda.csv — Cross-source overlap analysis (2 rows)
skill_technology_overlap_eda.csv — Cross-source overlap analysis (2 rows)
Compares the top skills extracted from job offers against the top technologies from the Stack Overflow survey. Contains columns
comparison, overlap_count, and overlap_values.The two comparisons are top_job_skills_vs_top_used_technologies and top_job_skills_vs_top_wanted_technologies. An overlap count of 0 indicates vocabulary mismatches between sources (e.g. "python" vs "Python") — a known data quality consideration.Generated by: 03_eda.ipynbtechnology_rankings_eda.csv / _used_eda.csv / _wanted_eda.csv — Snapshot rankings
technology_rankings_eda.csv / _used_eda.csv / _wanted_eda.csv — Snapshot rankings
Copies of the three technology ranking files as they appear at EDA time, preserved to ensure reproducibility of analysis even if the clean versions are regenerated. Schema is identical to their
data/clean/ counterparts.Generated by: 03_eda.ipynbFile Inventory
| File | Location | Rows | Notebook |
|---|---|---|---|
jobs_all_clean.csv | data/clean/ | 2,175 | 02_cleaning.ipynb |
jobs_clean.csv | data/clean/ | 944 | 02_cleaning.ipynb |
tecno_jobs_clean.csv | data/clean/ | 608 | 02_cleaning.ipynb |
scraping_jobs_template.csv | data/clean/ | 0 (headers only) | 01_data_collection.ipynb |
stack_tech_columns_clean.csv | data/clean/ | ~49,191 | 02_cleaning.ipynb |
technologies_clean_long_format.csv | data/clean/ | ~1,176,875 | 02_cleaning.ipynb |
technology_rankings.csv | data/clean/ | 372 | 02_cleaning.ipynb |
technology_rankings_used.csv | data/clean/ | 186 | 02_cleaning.ipynb |
technology_rankings_wanted.csv | data/clean/ | 186 | 02_cleaning.ipynb |
jobs_eda.csv | data/eda/ | 2,175 | 03_eda.ipynb |
cleaning_validation_summary_eda.csv | data/eda/ | 11 | 03_eda.ipynb |
skill_technology_overlap_eda.csv | data/eda/ | 2 | 03_eda.ipynb |
technology_rankings_eda.csv | data/eda/ | 372 | 03_eda.ipynb |
technology_rankings_used_eda.csv | data/eda/ | 186 | 03_eda.ipynb |
technology_rankings_wanted_eda.csv | data/eda/ | 186 | 03_eda.ipynb |