Column Schema Reference for the Three Core Datasets

This page documents the column schemas for the three most frequently queried datasets in the project. For a broader inventory of all available files, see Datasets. For details on how each source was collected, see Data Sources.

`jobs_all_clean.csv` — Main Unified Dataset

The primary analysis table. Combines all three job-offer sources into 2,175 rows and 17 columns. The first 11 columns are present in all individual source files; the last 6 are derived during cleaning in 02_cleaning.ipynb.

Loading the dataset

import pandas as pd
from pathlib import Path

DATA_CLEAN = Path("data/clean")
df = pd.read_csv(DATA_CLEAN / "jobs_all_clean.csv")

print(df.shape)          # (2175, 17)
print(df.dtypes)
print(df.isnull().sum())

Parsing the `skills` column

The skills column is stored as a Python list literal string, not a JSON array. Use ast.literal_eval to convert it to a proper Python list before analysis:

import ast

df['skills_list'] = df['skills'].apply(
    lambda x: ast.literal_eval(x) if pd.notna(x) and x.strip() != '' else []
)

# Explode into one skill per row for frequency analysis
skills_exploded = df.explode('skills_list')
print(skills_exploded['skills_list'].value_counts().head(10))

Some rows have skills = "[]" (an empty list string) or skills = NaN. The lambda above handles both cases safely. Do not use json.loads() — the list uses single quotes, which is not valid JSON.

Column reference

job_id

string

required

Unique offer identifier assigned during cleaning. Format: job_00001 through job_02175.Example: job_00001Nulls: None — every row has a unique job_id.

job_title

string

required

Normalized job title, lowercased and stripped of extra whitespace. Common values include data scientist, data engineer, data analyst, machine learning engineer, business intelligence analyst.Example: data scientistNulls: Very low — title is required in all source datasets.

company

string

Company name. May be anonymized as company_NNN for records originating from df_jobs (the international dataset uses anonymized employer names for privacy).Example: company_003 (anonymized) · Telefónica (named)Nulls: Some scraping records may have empty company fields.

location

string

Raw location string exactly as it appeared in the source. May contain multiple cities separated by ., country names, or hybrid/remote annotations. Not suitable for geographic filtering — use city_clean or is_remote instead.Example: "Grapevine, TX . Hybrid" · "Madrid" · "España"Nulls: Low for df_jobs and df_tecno; possible in scraping records.

salary

string

Raw salary string as scraped, preserving the original format including currency symbols, ranges, and period descriptors. Not suitable for numeric analysis — use salary_clean instead.Example: "€100,472 - €200,938" · "€118,733" · "€30.000 - €40.000 al año"Nulls: High for df_tecno (~78% null rate). Null for most scraping records.

job_type

string

Contract type or employment classification. Values vary by source (e.g. Full-time, Jornada completa, Contrato indefinido). Not standardized across sources.Example: Full-timeNulls: Frequent — many sources do not provide contract type.

post_date

string

Posting date as a raw string from the source. Format varies significantly across sources: LinkedIn-style relative dates ("17 days ago"), absolute dates, or scraping timestamps.Example: "17 days ago" · "a month ago" · "2025-01-15"Nulls: Some records have no date information.

link

string

URL to the original job offer page. May be empty or a redirect URL depending on the source. Adzuna records use redirect_url.Example: "https://www.adzuna.es/jobs/details/..."Nulls: Common — many df_jobs and df_tecno records have no link preserved.

skills

string

Skills associated with the offer, stored as a Python list literal string. Each skill is a lowercased technology or competency keyword. Must be parsed with ast.literal_eval before use. See the parsing note above.Example: "['python', 'sql', 'machine learning', 'spark']"Nulls: Some offers have no skills listed (NaN) or an empty list ("[]"). Scraping records from Adzuna do not include structured skill extraction.

industry

string

Industry sector of the hiring company. Present mainly in df_jobs records. Values include: Technology, Finance, Healthcare, Retail, Manufacturing, Education, Energy.Example: TechnologyNulls: High for df_tecno and scraping records — industry is not scraped from those sources.

seniority_level

string

Experience level required. Standardized during cleaning to: junior, mid, senior, lead. Original source values are normalized to this vocabulary.Example: seniorNulls: High overall — seniority is frequently absent in Spanish job postings from TecnoEmpleo.

source_dataset

string

required

Identifies which raw source contributed this row. Used for source-level segmentation throughout the analysis.

Value	Source
`df_jobs`	`data_science_job_posts_2025.csv`
`df_tecno`	`tecnoempleo_spain_2026.csv`
`scraping`	Adzuna API results

Nulls: None.

salary_clean

float

Numeric salary in EUR, computed as the midpoint of salary ranges. Outliers are flagged separately in salary_clean_outlier. Non-parseable salary strings are set to NaN.Example: 150705.0 (midpoint of €100,472 – €200,938) · 118733.0Nulls: High — inherits nulls from salary. Coverage is best for df_jobs records.

location_clean

string

Lightly cleaned version of location: trimmed whitespace and normalized encoding. Still a free-text string. For structured geographic analysis, use city_clean.Example: "Grapevine, TX . Hybrid"Nulls: Mirrors nulls in location.

city_clean

string

City name extracted from location_clean using pattern matching. Best-effort extraction — complex multi-city strings yield the first recognizable city.Example: Grapevine · Madrid · BarcelonaNulls: Present where city extraction failed or location was missing.

is_remote

boolean

required

True if the offer is classified as remote (full or partial), False otherwise. Derived from keywords in location and job_type fields during cleaning.Example: False · TrueNulls: None — defaults to False when evidence is absent.

salary_clean_outlier

boolean

required

True if salary_clean was flagged as a statistical outlier using IQR-based detection during cleaning. Outlier records are retained in the dataset but this flag allows easy exclusion for salary distribution analysis.Example: FalseNulls: None — False when salary_clean is also null.

`technology_rankings.csv` — Stack Overflow Technology Counts

Aggregated technology popularity derived from the Stack Overflow 2025 Developer Survey. Each row counts how many survey respondents reported using or wanting to use a particular technology in a given category.

df_rankings = pd.read_csv(DATA_CLEAN / "technology_rankings.csv")
print(df_rankings.shape)  # (372, 4)

# Top 10 most-used languages
langs_used = df_rankings[
    (df_rankings['category'] == 'language') &
    (df_rankings['type'] == 'used')
].nlargest(10, 'count')
print(langs_used[['technology', 'count']])

Technology	Count
openAI GPT (chatbot models)	13,424
Anthropic: Claude Sonnet	7,063
Gemini (Flash general purpose models)	5,823
openAI Reasoning models	5,716

`technologies_clean_long_format.csv` — Respondent-Level Long Format

The tidy, normalized version of the Stack Overflow survey technology data. Every row represents a single respondent’s relationship with a single technology. This is the source used to compute technology_rankings.csv.

df_long = pd.read_csv(DATA_CLEAN / "technologies_clean_long_format.csv")
print(df_long.shape)  # (~1,176,875, 4)

# Count respondents who use Python
python_users = df_long[
    (df_long['technology'] == 'Python') &
    (df_long['type'] == 'used')
].shape[0]
print(f"Python users in survey: {python_users}")

This file is ~1.18 million rows and approximately 60 MB on disk. Load only the columns or filters you need in memory-constrained environments.

response_id

string

required

Anonymous respondent identifier from the Stack Overflow 2025 survey. Matches response_id in stack_tech_columns_clean.csv for joining back to the wide format.Example: 1 · 2 · 90231

technology

string

required

Name of the technology reported by this respondent. Same vocabulary as technology_rankings.csv.Example: SQL · Bash/Shell (all shells) · openAI GPT (chatbot models)

Introducción

Notebooks

Datos y Datasets

Análisis y Resultados

Column Schema Reference for the Three Core Datasets

`jobs_all_clean.csv` — Main Unified Dataset

Loading the dataset

Parsing the `skills` column

Column reference

`technology_rankings.csv` — Stack Overflow Technology Counts

`technologies_clean_long_format.csv` — Respondent-Level Long Format

Build docs developers (and LLMs) love

Introducción

Notebooks

Datos y Datasets

Análisis y Resultados

Documentation Index

​jobs_all_clean.csv — Main Unified Dataset

​Loading the dataset

​Parsing the skills column

​Column reference

​technology_rankings.csv — Stack Overflow Technology Counts

​technologies_clean_long_format.csv — Respondent-Level Long Format

Build docs developers (and LLMs) love

`jobs_all_clean.csv` — Main Unified Dataset

Loading the dataset

Parsing the `skills` column

Column reference

`technology_rankings.csv` — Stack Overflow Technology Counts

`technologies_clean_long_format.csv` — Respondent-Level Long Format