Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/Gema-Villanueva/proyecto-eda-roles-datos/llms.txt

Use this file to discover all available pages before exploring further.

The project produces two families of CSV files: cleaned datasets under data/clean/ that are the direct output of notebook 02_cleaning.ipynb, and EDA output datasets under data/eda/ that are written by 03_eda.ipynb as analysis artifacts. Both families are version-controlled. Raw source files in data/raw/ are gitignored and must be obtained separately — see Data Sources for acquisition details.
Raw data files (data/raw/*.csv) are excluded from the repository via .gitignore. You must download or recreate them before running any notebook. The cleaned files listed here are committed and can be used directly without re-running the pipeline.

Job Offer Datasets

These files contain the cleaned and unified job posting records. They are produced by 02_cleaning.ipynb.

jobs_all_clean.csv

The primary unified dataset. All three job offer sources (df_jobs, df_tecno, df_scraping) merged into a single file with 17 standardized columns, including cleaned salary, city, and remote-work flags. 2,175 rows.

jobs_clean.csv

Cleaned version of the data_science_job_posts_2025.csv source only (df_jobs). Contains richer metadata columns such as status, headquarter, ownership, company_size, and revenue not present in the merged file. 944 rows.

tecno_jobs_clean.csv

Cleaned version of the TecnoEmpleo scrape (df_tecno). Standardized to the 11-column common schema shared with the merged dataset. High salary-null rate (~78%). 608 rows.

scraping_jobs_template.csv

An empty schema template for Adzuna API scraping results. Used by 01_data_collection.ipynb to define the expected column structure before new scraping runs are appended. Contains headers only — no data rows.

Column summary — common job offer schema

All individual source files and the merged jobs_all_clean.csv share these core columns before the cleaned/derived fields are added:
ColumnPresent in merged?Description
job_titleNormalized lowercase title
companyCompany name or anonymized ID
locationRaw location string
salaryRaw salary string
job_typeContract type
post_dateRaw posting date
linkURL to original offer
skillsPython list string of skills
industryIndustry sector
seniority_leveljunior / mid / senior / lead
source_datasetOrigin: df_jobs / df_tecno / scraping
salary_clean✅ (merged only)Numeric mid-point salary (EUR)
location_clean✅ (merged only)Cleaned location string
city_clean✅ (merged only)Extracted city name
is_remote✅ (merged only)Boolean remote flag
salary_clean_outlier✅ (merged only)Boolean outlier flag
See the full Schema Reference for detailed per-column documentation including types, examples, and null rates.

Technology Datasets

These files are derived from the Stack Overflow 2025 Developer Survey (stackoverflow_2025_results.csv). They are generated by 02_cleaning.ipynb.

technology_rankings.csv

Combined ranking of all technologies (used and wanted) across all categories. 4 columns: category, type, technology, count. 372 rows (186 used + 186 wanted entries).

technology_rankings_used.csv

Subset of technology_rankings.csv filtered to type = used. Technologies developers reported currently using. 186 rows.

technology_rankings_wanted.csv

Subset of technology_rankings.csv filtered to type = wanted. Technologies developers reported wanting to learn. 186 rows.

technologies_clean_long_format.csv

Long-format (tidy) representation of the Stack Overflow technology data. One row per respondent–technology pair. Columns: response_id, technology, category, type. ~1.18 million rows.

stack_tech_columns_clean.csv

Wide-format Stack Overflow data with technology columns in their original survey form (semicolon-separated strings). Used as the source for generating the long-format and ranking files. Columns follow the _have_worked_with / _want_to_work_with naming convention. ~49,191 rows.

Technology categories

The category column in the rankings and long-format files uses the following values:
Category valueDescription
languageProgramming languages (Python, SQL, JavaScript…)
databaseDatabases and data stores (PostgreSQL, MongoDB…)
platformCloud and DevOps platforms (AWS, Docker, Kubernetes…)
web_frameworkWeb and API frameworks (FastAPI, React, Spring Boot…)
development_environmentIDEs and editors (VS Code, JupyterLab…)
ai_model_toolAI/ML model tools (OpenAI GPT, Claude, Gemini…)

EDA Output Datasets

These files are written by 03_eda.ipynb during analysis. They capture intermediate results and validation snapshots and are useful for auditing or downstream reporting without re-running the full EDA notebook.
An extended version of jobs_all_clean.csv with four additional columns derived during EDA:
ColumnDescription
job_familyBroad role grouping (e.g. data_science_ai, data_engineering)
work_modalityCleaned work modality: remote, hybrid, onsite, unknown
post_date_parsedParsed datetime version of post_date
post_monthYear-month string extracted from post_date_parsed
Generated by: 03_eda.ipynb
A structured log of automated data-quality checks run at the end of the cleaning pipeline. Each row represents one assertion with columns check, passed, and detail. All checks must pass (True) before EDA proceeds.Example checks include: job_id_unique, salary_clean_numeric, is_remote_boolean, expected_clean_files_exist.Generated by: 03_eda.ipynb
Compares the top skills extracted from job offers against the top technologies from the Stack Overflow survey. Contains columns comparison, overlap_count, and overlap_values.The two comparisons are top_job_skills_vs_top_used_technologies and top_job_skills_vs_top_wanted_technologies. An overlap count of 0 indicates vocabulary mismatches between sources (e.g. "python" vs "Python") — a known data quality consideration.Generated by: 03_eda.ipynb
Copies of the three technology ranking files as they appear at EDA time, preserved to ensure reproducibility of analysis even if the clean versions are regenerated. Schema is identical to their data/clean/ counterparts.Generated by: 03_eda.ipynb

File Inventory

FileLocationRowsNotebook
jobs_all_clean.csvdata/clean/2,17502_cleaning.ipynb
jobs_clean.csvdata/clean/94402_cleaning.ipynb
tecno_jobs_clean.csvdata/clean/60802_cleaning.ipynb
scraping_jobs_template.csvdata/clean/0 (headers only)01_data_collection.ipynb
stack_tech_columns_clean.csvdata/clean/~49,19102_cleaning.ipynb
technologies_clean_long_format.csvdata/clean/~1,176,87502_cleaning.ipynb
technology_rankings.csvdata/clean/37202_cleaning.ipynb
technology_rankings_used.csvdata/clean/18602_cleaning.ipynb
technology_rankings_wanted.csvdata/clean/18602_cleaning.ipynb
jobs_eda.csvdata/eda/2,17503_eda.ipynb
cleaning_validation_summary_eda.csvdata/eda/1103_eda.ipynb
skill_technology_overlap_eda.csvdata/eda/203_eda.ipynb
technology_rankings_eda.csvdata/eda/37203_eda.ipynb
technology_rankings_used_eda.csvdata/eda/18603_eda.ipynb
technology_rankings_wanted_eda.csvdata/eda/18603_eda.ipynb
To quickly verify that all expected files are present after cloning, load and inspect data/eda/cleaning_validation_summary_eda.csv. The row expected_clean_files_exist should show passed = True.

Build docs developers (and LLMs) love