This page documents the column schemas for the three most frequently queried datasets in the project. For a broader inventory of all available files, see Datasets. For details on how each source was collected, see Data Sources.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/Gema-Villanueva/proyecto-eda-roles-datos/llms.txt
Use this file to discover all available pages before exploring further.
jobs_all_clean.csv — Main Unified Dataset
The primary analysis table. Combines all three job-offer sources into 2,175 rows and 17 columns. The first 11 columns are present in all individual source files; the last 6 are derived during cleaning in 02_cleaning.ipynb.
Loading the dataset
Parsing the skills column
The skills column is stored as a Python list literal string, not a JSON array. Use ast.literal_eval to convert it to a proper Python list before analysis:
Column reference
Unique offer identifier assigned during cleaning. Format:
job_00001 through job_02175.Example: job_00001Nulls: None — every row has a unique job_id.Normalized job title, lowercased and stripped of extra whitespace. Common values include
data scientist, data engineer, data analyst, machine learning engineer, business intelligence analyst.Example: data scientistNulls: Very low — title is required in all source datasets.Company name. May be anonymized as
company_NNN for records originating from df_jobs (the international dataset uses anonymized employer names for privacy).Example: company_003 (anonymized) · Telefónica (named)Nulls: Some scraping records may have empty company fields.Raw location string exactly as it appeared in the source. May contain multiple cities separated by
., country names, or hybrid/remote annotations. Not suitable for geographic filtering — use city_clean or is_remote instead.Example: "Grapevine, TX . Hybrid" · "Madrid" · "España"Nulls: Low for df_jobs and df_tecno; possible in scraping records.Raw salary string as scraped, preserving the original format including currency symbols, ranges, and period descriptors. Not suitable for numeric analysis — use
salary_clean instead.Example: "€100,472 - €200,938" · "€118,733" · "€30.000 - €40.000 al año"Nulls: High for df_tecno (~78% null rate). Null for most scraping records.Contract type or employment classification. Values vary by source (e.g.
Full-time, Jornada completa, Contrato indefinido). Not standardized across sources.Example: Full-timeNulls: Frequent — many sources do not provide contract type.Posting date as a raw string from the source. Format varies significantly across sources: LinkedIn-style relative dates (
"17 days ago"), absolute dates, or scraping timestamps.Example: "17 days ago" · "a month ago" · "2025-01-15"Nulls: Some records have no date information.URL to the original job offer page. May be empty or a redirect URL depending on the source. Adzuna records use
redirect_url.Example: "https://www.adzuna.es/jobs/details/..."Nulls: Common — many df_jobs and df_tecno records have no link preserved.Skills associated with the offer, stored as a Python list literal string. Each skill is a lowercased technology or competency keyword. Must be parsed with
ast.literal_eval before use. See the parsing note above.Example: "['python', 'sql', 'machine learning', 'spark']"Nulls: Some offers have no skills listed (NaN) or an empty list ("[]"). Scraping records from Adzuna do not include structured skill extraction.Industry sector of the hiring company. Present mainly in df_jobs records. Values include:
Technology, Finance, Healthcare, Retail, Manufacturing, Education, Energy.Example: TechnologyNulls: High for df_tecno and scraping records — industry is not scraped from those sources.Experience level required. Standardized during cleaning to:
junior, mid, senior, lead. Original source values are normalized to this vocabulary.Example: seniorNulls: High overall — seniority is frequently absent in Spanish job postings from TecnoEmpleo.Identifies which raw source contributed this row. Used for source-level segmentation throughout the analysis.
Nulls: None.
| Value | Source |
|---|---|
df_jobs | data_science_job_posts_2025.csv |
df_tecno | tecnoempleo_spain_2026.csv |
scraping | Adzuna API results |
Numeric salary in EUR, computed as the midpoint of salary ranges. Outliers are flagged separately in
salary_clean_outlier. Non-parseable salary strings are set to NaN.Example: 150705.0 (midpoint of €100,472 – €200,938) · 118733.0Nulls: High — inherits nulls from salary. Coverage is best for df_jobs records.Lightly cleaned version of
location: trimmed whitespace and normalized encoding. Still a free-text string. For structured geographic analysis, use city_clean.Example: "Grapevine, TX . Hybrid"Nulls: Mirrors nulls in location.City name extracted from
location_clean using pattern matching. Best-effort extraction — complex multi-city strings yield the first recognizable city.Example: Grapevine · Madrid · BarcelonaNulls: Present where city extraction failed or location was missing.True if the offer is classified as remote (full or partial), False otherwise. Derived from keywords in location and job_type fields during cleaning.Example: False · TrueNulls: None — defaults to False when evidence is absent.True if salary_clean was flagged as a statistical outlier using IQR-based detection during cleaning. Outlier records are retained in the dataset but this flag allows easy exclusion for salary distribution analysis.Example: FalseNulls: None — False when salary_clean is also null.technology_rankings.csv — Stack Overflow Technology Counts
Aggregated technology popularity derived from the Stack Overflow 2025 Developer Survey. Each row counts how many survey respondents reported using or wanting to use a particular technology in a given category.
Technology category. One of:
language, database, platform, web_framework, development_environment, ai_model_tool.Example: ai_model_toolSurvey response type. Either
used (have worked with) or wanted (want to work with).Example: usedTechnology name exactly as it appears in the Stack Overflow survey. Capitalization and spacing are preserved from the source.Example:
openAI GPT (chatbot models) · Python · PostgreSQLNumber of Stack Overflow survey respondents who selected this technology for the given
type. Based on ~90,000 global respondents.Example: 13424 (openAI GPT chatbot models, used)Top used AI tools:| Technology | Count |
|---|---|
| openAI GPT (chatbot models) | 13,424 |
| Anthropic: Claude Sonnet | 7,063 |
| Gemini (Flash general purpose models) | 5,823 |
| openAI Reasoning models | 5,716 |
technology_rankings_used.csv and technology_rankings_wanted.csv are direct subsets of this file filtered by type. Use the combined file for side-by-side used vs. wanted comparisons, or the split files for simpler queries.technologies_clean_long_format.csv — Respondent-Level Long Format
The tidy, normalized version of the Stack Overflow survey technology data. Every row represents a single respondent’s relationship with a single technology. This is the source used to compute technology_rankings.csv.
Anonymous respondent identifier from the Stack Overflow 2025 survey. Matches
response_id in stack_tech_columns_clean.csv for joining back to the wide format.Example: 1 · 2 · 90231Name of the technology reported by this respondent. Same vocabulary as
technology_rankings.csv.Example: SQL · Bash/Shell (all shells) · openAI GPT (chatbot models)Technology category. Same controlled vocabulary as
technology_rankings.csv.Example: language · database · ai_model_toolWhether this respondent
used or wanted this technology.Example: used · wanted