Skill Aggregation Bias: 35 Categories Collapse Skill Data

Skill demand analysis is one of the primary value propositions of a LinkedIn-based HR intelligence product. Employers want to know which skills are in demand, how demand is shifting, and what skill profiles define competitive candidates in their sector. The LinkedIn dataset provides a job_skills table that links postings to skill tags — but these tags are aggregated into 35 broad categories that collapse meaningfully distinct skills into undifferentiated buckets. The result is a skill demand signal that is real but coarse: useful for macro-level category trends, but blind to the fine-grained tool and technology distinctions that actually determine whether a candidate is qualified for a specific role.

The Numbers

Metric	Value
Total skill assignments in dataset	213,768
Distinct skill categories (`skill_abr`)	35
Average skill assignments per posting	~1.7
`skills_desc` column (free text) null rate	~98%

With 213,768 skill assignments spread across only 35 categories, each category carries an average of ~6,108 skill assignments — an enormous amount of information compressed into a single label. The skills_desc column, which was presumably intended to carry free-text skill descriptions and could have provided the granularity the abbreviations lack, is 98% null.

The 35-Category Problem

LinkedIn’s skill category abbreviations include labels such as: IT · DATA · MRKT · DSGN · WRT · SALE · MGMT · FINC · ENGG · HLTH · LEGL · EDUC · OPER · COMS · MNFG Consider what collapses into a single IT or DATA tag:

Collapsed into `DATA`	Collapsed into `IT`
Python	JavaScript
SQL	TypeScript
R	Java
Apache Spark	Go / Rust
dbt (data build tool)	Kubernetes
Tableau	AWS / Azure / GCP
Power BI	Linux system administration
Apache Airflow	Network security
TensorFlow	Salesforce / SAP
PyTorch	API development

A posting requiring a senior dbt engineer with Snowflake and Airflow experience produces the same DATA tag as a posting requiring a junior Excel analyst. A requirement for PyTorch-specific deep learning expertise is indistinguishable from a requirement for Tableau dashboarding. From skill_abr alone, these roles are identical.

Consequence for DataTalent Solutions

This aggregation creates a specific problem for client-facing skill gap analysis:

Tool-specific hiring signals are invisible: DataTalent cannot determine whether Spanish companies are hiring for Tableau analysts, Power BI developers, or dbt engineers — all appear as “DATA” demand
Seniority cannot be inferred from skill tags: a DATA tag on a junior analyst posting looks the same as a DATA tag on a Principal Data Architect posting
Adjacent skill combinations are lost: the combination of DATA + MGMT might indicate a data product manager role or a data governance lead — two very different profiles that would require different candidate pipelines
Emerging technology demand is invisible: new tools (e.g., dbt, Polars, LangChain) that emerged after LinkedIn’s skill taxonomy was defined may be tagged inconsistently or not at all

Do not present skill_abr category rankings to clients as a fine-grained skill demand analysis. The 35-category taxonomy cannot distinguish between adjacent tools in the same technology family. Frame these rankings as macro-level category trends only.

Inspecting Skill Categories

skills_map = pd.read_csv('archive/mappings/skills.csv')
print(skills_map.to_string())  # All 35 categories

top_skills = (
    df['job_skills_list'].str.split(', ')
    .explode()
    .value_counts()
    .head(10)
)
print(top_skills)

This code reveals the full skill category label set and the frequency distribution of skill assignments across postings. Running it establishes the ceiling of what skill_abr-based analysis can tell you before any mitigation is applied.

Mitigation Strategies

Short-term: Description Field NLP

The description field contains free-text job descriptions and is the richest source of fine-grained skill signals in the dataset. NLP-based skill extraction can partially compensate for aggregation bias:

Named entity recognition (NER) to extract technology names, tool names, and certification requirements
Keyword matching against external skill taxonomies (ESCO, ONET, Lightcast) to tag individual tools
Frequency analysis of technology mentions across description text to reconstruct tool-specific demand rankings

Long-term: External Skill Taxonomy Integration

Supplement the LinkedIn skill tags with established external taxonomies that provide the granularity the dataset lacks:

Taxonomy	Coverage	Use Case
ESCO (European Skills/Competences)	EU-focused, multilingual	Spanish market alignment
*ONET** (US Dept of Labor)	US-focused, highly granular	US benchmark comparisons
Lightcast (formerly Burning Glass)	Commercial, real-time	Fine-grained tool demand
SFIA (Skills Framework for Information Age)	IT-specific, leveled	Seniority calibration

The ESCO taxonomy is available in Spanish and aligns with EU labor market frameworks. It is the most appropriate external taxonomy for DataTalent Solutions’ Spanish client work. ESCO provides over 13,890 skills and competences at a granularity that the 35-category LinkedIn taxonomy cannot match.

Overview

Dataset

Analysis Phases

Bias Analysis

Findings & Recommendations

The Numbers

The 35-Category Problem

Consequence for DataTalent Solutions

Inspecting Skill Categories

Mitigation Strategies

Short-term: Description Field NLP

Long-term: External Skill Taxonomy Integration

Build docs developers (and LLMs) love

Overview

Dataset

Analysis Phases

Bias Analysis

Findings & Recommendations

Documentation Index

​The Numbers

​The 35-Category Problem

​Consequence for DataTalent Solutions

​Inspecting Skill Categories

​Mitigation Strategies

​Short-term: Description Field NLP

​Long-term: External Skill Taxonomy Integration

Build docs developers (and LLMs) love

The Numbers

The 35-Category Problem

Consequence for DataTalent Solutions

Inspecting Skill Categories

Mitigation Strategies

Short-term: Description Field NLP

Long-term: External Skill Taxonomy Integration